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Ich bin ein Esel, und will getreu, 
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Theme: 
Bayesian Philosophy of Science 


Among the greatest achievements of science are the laws, models and the- 
ories it has came up with. Newton’s laws of mechanics, Bohr’s model 
of the atom and Einstein’s Theory of Relativity gave us unprecedented 
insights into the nature of physical reality. Similarly, Mendel’s laws of 
inheritance, Darwin’s theory of natural selection, and Crick and Watson’s 
innovations in molecular biology elucidated how species develop and how 
traits and properties are passed on from one generation to the next. Pi- 
oneers of rational choice theory like Von Neumann, Savage and Arrow 
explained behavior in terms of beliefs and preferences and came up with 
powerful axioms for rational decision-making under uncertainty. 

This enumeration does not intend to depreciate the value of experi- 
mental work in science. Rather, we would like to motivate why theories 
and models are central for understanding phenomena, predicting future 
events and transferring scientific knowledge to other domains. Therefore 
the assessment of scientific theories and models is a central part of sci- 
entific reasoning. 

This book investigates how Bayesian inference can contribute to this 
goal. While we do not want to claim that scientific reasoning is essentially 
Bayesian, we claim that Bayesian models can elucidate diverse aspects of 
scientific reasoning, increasing our understanding of how science works 
and why it is so successful. The book is written a cycle of variations 
on this theme; it applies Bayesian inference to eleven different aspects of 
scientific reasoning. 

In this introduction, we explain the constitutive principles and philo- 
sophical foundations of Bayesian inference, as well as some particular rea- 
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soning techniques (e.g., Bayesian networks) that we need in the remainder 
of the book. Then we describe our methodological approach—Bayesian 
philosophy of science—in slightly more detail. The level is introductory; 
no knowledge of calculus or higher mathematics is required. 


The Static Dimension of Bayesian Inference: 
Probability and Degrees of Belief 


In science as well as in ordinary life, we routinely make a distinction be- 
tween more and less credible hypotheses. Consider, for example, the ques- 
tion which nation will win the 2016 European Football Cup. We may con- 
sider Albania a very unlikely candidate, England not completely implausi- 
ble, and we may find it likely that the winner will be either France or Ger- 
many. This example illustrates that the epistemic standing of a hypothesis 
is no all-or-nothing affair, but a matter of gradation. Here the Bayesians 
step in: they use the concept of degree of belief to describe epistemic at- 
titudes about uncertain propositions, and they represent these degrees of 
belief by a particular mathematical structure: probability functions. These 
two modeling assumptions are the central elements of Bayesian inference. 
In other words, Bayesians regard probabilities as expressions of subjective 
uncertainty—an interpretation that goes back to the English mathemati- 
cian, reverend and philosopher Thomas Bayes (1701-1761) (Bayes, 1763). 
The probability calculus has a long history as a tool for handling sub- 
jective uncertainty. In particular, it is one of the dominant paradigms in the 
psychology of human reasoning (e.g., Oaksford and Chater, 2000). Other 
scientific applications abound, such as Bayesian inference in phylogenet- 
ics, Bayesian interpretations of quantum mechanics, statistical inference 
and causal induction (e.g., Bernardo and Smith, 1994; Spirtes et al., 2000). 
Bayesian reasoning is also widely applied in philosophy: it is a standard 
tool in various branches of epistemology (e.g., Bovens and Hartmann, 
2003; Pettigrew, 2015) and in the foundations of decision theory and ratio- 
nal choice (e.g., Jeffrey, 1971; Savage, 1972). The prominence of Bayesian 
inference in established scientific and philosophical theories recommends 
it as the default model for modeling uncertain reasoning. We do not claim 
that it is on foundational grounds superior to alternatives such as rank- 
ing functions (Spohn, 2012) or Dempster-Shafer theory (Shafer, 1976): we 
only claim that the past successes and the wide scope of Bayesian mod- 
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els recommend them as an excellent tool for studying scientific reasoning. 
Practical considerations, such as the simplicity of the probability axioms 
and the existence of a well-developed mathematical theory underneath 
them, support the case for Bayesian philosophy of science, too. 


The distinctive feature of Bayesian inference is the central role of degree 
of belief. More traditional descriptions of epistemic attitudes, e.g., theories 
that just distinguish between belief, disbelief and suspension of judgment, 
struggle to adequately describe graded epistemic attitudes. This fails to 
account for many cases of scientific reasoning where we do hold graded 
beliefs. But psychological realism is not the only force that pulls into 
the direction of a graded theory of epistemic attitudes. An all-or-nothing 
account of epistemic attitudes also leads into philosophical trouble, such as 
in the famous lottery paradox (Kyburg, 1961). For instance, if just one out 
of 1.000.000 tickets in a lottery is winning, then for each single ticket #/ 
(i € {1,2,...,1.000.000}), we seem to be entitled to believe that it is not 
the winning one. However, we also believe that there is a winning ticket 
in the lottery. Hence, the set of the propositions that we believe (“There is 
a winning ticket”, “ticket #1 does not win”, “ticket #2 does not win”, etc.) 
is inconsistent, which is at least prima facie an undesirable consequence. 
For theories of graded belief, no such inconsistency arises (for discussion 
of the full vs. graded belief relationship, see Leitgeb, 2014; Fitelson et al., 
2016). 


If we want to use the concept of degree of belief to describe which 
scientific theories are more credible than others, we have to say something 
about the rules that rational degrees of belief have to satisfy. After all, 
the objects of degrees of belief—propositions—are related to each other in 
manifold ways, and these relations constrain the set of degrees of belief 
that we can rationally entertain. For example, it seems that we cannot 
believe proposition A to a higher degree than the proposition A / B since 
A ( B cannot be true without A being true. For Bayesians, the probability 
calculus captures all relevant constraints that rational degrees of belief 
over a set of logically interconnected propositions have to satisfy. 


Let L be a propositional language and let £ be a o-algebra over well- 
formed formulae of L: that is, a set of propositions that contains the tautol- 
ogy and contradiction, is closed under logical negation and under infinite 
disjunction of propositions. This method of construction ensures that the 
algebra contains all truth-functional compounds of propositions which are 
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already in the algebra. 

A probability function p : £ — [0,1] operates on such an algebra of 
propositions and takes values in the unit interval [0,1]. That is, if we as- 
sign degrees of belief to two propositions, we also have degrees of belief in 
their truth-functional compounds. More precisely, truth-functional opera- 
tors affect the assignment of degrees of belief according to the following 
axioms of probability (Kolmogorov, 1933): 


Probability Function For a propositional language L with a o-algebra A, p : 
A — [0,1] is called a probability function if and only if it satisfies the 
following three properties: 


Lora i 
2. p(rA) = 1— p(A). 


3. For mutually exclusive propositions Ay, Az, A3,...: 


Pp ( VV as] = y p(An) (1) 
neN n=1 


In this model, degrees of belief correspond to numbers in the inter- 
val [0,1], where zero denotes minimal and one denotes maximal degree 
of belief. It is not hard to motivate the three above constraints for ra- 
tional agents: Each tautology is assigned maximal degree of belief. If A 
is strongly believed, its negation —A is weakly believed, and vice versa. 
In particular, the degree of belief in A and —A add up to unity. Finally, 
the degree of belief in the disjunction of mutually exclusive propositions 
corresponds to the sum of the individual propositions. This can be under- 
stood as summing up the weight of the possible worlds where Aj, Az, A3 
etc. obtain. Prima facie, these constraints capture our everyday use of the 
word “probable”. 

It is notable that the third condition uses an infinite instead of a finite 
sum. Indeed, this choice is controversial in the literature. There is a sub- 
stantial debate about whether probabilistic degrees of belief should satisfy 
this condition of countable additivity instead of the weaker requirement 
of finite additivity: p(A) + p(B) = p(A VB) for two mutually exclusive 
propositions A and B. Several authors argue that accepting countable ad- 
ditivity amounts to making substantial and unwarranted epistemological 
assumptions (de Finetti, 1972, 1974; Kelly, 1996; Howson, 2008). Jaynes 
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(2003) responds that countable additivity naturally flows from a proper 
mathematical modeling of uncertainty. Kadane et al. (1999) discuss fur- 
ther consequences of choosing finitely instead of countably additive prob- 
ability functions. Fortunately, this choice does not make a difference for 
most applications in this book. Since countable additivity is standardly 
assumed in statistical inference, which is one of the focal points of this 
book (Variations 9-11), we take all probability functions to be countably 
additive. 

Following Hailperin (1984, 1996) and Popper (2002), we conceptual- 
ize probability functions as operating on a o-algebra of propositions. The 
alternative to this sentential approach consists in Kolmogorov’s measure- 
theoretic approach: probabilities operate on o-algebras of sets, and the ob- 
jects of degrees of belief correspond to the epistemic possibilities that an 
agent considers (e.g., Easwaran, 2011a). In this context, one usually speaks 
of probability measures—our use of the term “probability function” indi- 
cates that we are not considering probability in the general set-theroetic 
sense. The sentential interpretation strikes us as more natural, simpler 
and closer to the purpose of this book, namely to model the assessment of 
scientific theories. As we will see, it is also well suited for justifying why 
degrees of belief should be probabilistic. 

Let us get back to the three probability axioms. While their qualitative 
motivation is highly plausible, their quantitative form is harder to justify. 
Why should rational degrees of belief satisfy these peculiar axioms rather 
than another set of axioms with identical qualitative properties? Three 
types of arguments have been proposed as an answer: 


1. Dutch Book arguments, associated with the names of Ramsey, De 
Finetti, and Jeffrey; 


2. Decision-theoretic arguments a la Savage and von Neu- 


mann/ Morgenstern; 
3. Epistemic arguments due to Cox, Joyce, Pettigrew, and others. 


It is important to mention upfront that all these arguments contain 
some form of idealization: the rational agents who are supposed to con- 
form to the axioms of probability are ideally rational agents, that is, agents 
who are immune to trivial reasoning fallacies that real agents commit from 


time to time. Instead of claiming that the degrees of belief of real agents 
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conform to the probability axioms, the arguments below aim to show that 
ideally rational agents should have probabilistically coherent degrees of be- 
lief. After all, philosophical theories of degrees of belief are normative in 
the first place. They provide a logic of uncertain reasoning in the same 
sense that propositional logic does for classical deductive reasoning. 

We begin with the famous Dutch Book Arguments. Frank Ramsey 
(1926) observed that many of our actions are based on our degrees of 
belief. He regarded human action as a kind of betting, similar to accepting 
a bet on the next European football champion: 


[...] all our lives we are in sense betting. Whenever we go to 
the station we are betting that a train will really run, and if we 
had not a sufficient degree of belief in this we should decline 
the bet and stay at home. (Ramsey, 1926, 85) 


The most pervasive example of the link between belief and betting are 
perhaps transactions on financial markets—traders buy and sell stocks, 
certificates and options, according to their degrees of belief that these will 
rise and fall. Someone with a high degree of belief that an option will 
become worthless will sell it more eagerly, and for a lower price, than 
someone who is convinced that it will increase in value. 

To distinguish rational from irrational degrees of belief, Ramsey uses 
the instrumental, economic conception of rationality and focuses on a par- 
ticular type of belief-guided action: bets. If we accept and decline bets 
according to our degrees of belief, only probabilistic degrees of belief will 
avoid a sure loss of money, or so Ramsey argues. To this end, Ramsey 
defines degree of belief in the following way: having a degree of belief 
p(A) = x means that we consider a bet with stake €x fair if it pays €1 
if A is true, and nothing if A is false. This definition can be operational- 
ized further: having a degree of belief p(A) = x implies that the agent is 
indifferent between taking the role of the bettor and the bookie in a bet 
on proposition A with betting odds 1/p(A). This technique resembles the 
famous veil of ignorance for disclosing judgments about the fair distribu- 
tion of goods in a society (Rawls, 1971): the agent does not know whether 
she will end up as the bettor or as the bookie. As an alternative proposal, 
consider a fully dispositional, behaviorist definition of degrees of belief 
(e.g., de Finetti, 1937): agent S’s degree of belief in proposition A is equal 
to p if and only if p is the price at which S would sell or buy a bet on A 
that pays €1 if A occurred. 
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Ramsey then continues that no system of degrees of belief can be fair if 
it gives rise to a system of bets (combining the roles of bettor and bookie) 
that implies a sure loss for the agent. Such a system of bets is called a 
Dutch book. By establishing an isomorphism between bets and degrees 
of belief, Ramsey grounds the famous Dutch Book Argument: he demon- 
strates that degrees of belief that violate the axioms of probability will 
give rise to Dutch Books. Conversely, all probabilistic systems of degrees 
of belief are immune to Dutch books. Hence, Ramsey infers that only 
probabilistic degrees of belief are rational (see also Kemeny, 1955). 


The cogency of such arguments has been debated in various places 
(e.g., de Finetti, 1972; Howson, 2008; Hajek and Hartmann, 2010; Hart- 
mann and Sprenger, 2010; Easwaran, 2011a,b). For the sake of simplicity, 
suppose a fully behaviorist interpretation of degrees of belief. The Dutch 
Book Theorem assumes that the agent accepts all bets where the proposed 
odds are higher than her personal odds (viz., degrees of belief), and is 
ready to act as bookie on all bets where the proposed odds are lower than 
her personal odds. Real agents, however, are often risk-averse and the 
stake may influence their willingness to take a side in the bet. In other 
words, an agent’s degree of belief in a proposition may depend on the 
amount of money that she has to bet. Moreover, the agent may be unwill- 
ing to engage in any bet if the stakes are high enough, and be willing to 
suffer a Dutch book if the stakes are low enough. None of these behav- 
iors strikes us as blatantly irrational unless we presuppose what is to be 
shown: that rationality equals immunity to Dutch Books. 


These objections suggest that a straightforward operationalization of 
degrees of belief in terms of betting behavior or fairness judgments, about 
bets, is problematic. With a crumbling link between degrees of belief and 
fair betting odds, the Dutch Book Argument loses its a part of normative 
force. Sure, in many situations we can still argue for a strong dependency 
between degrees of belief and fair betting odds, but it might go too far to 
identify both concepts with each other. For this reason, we now move to 
the second, decision-theoretic argument in favor of the thesis that degrees 
of belief should be probabilistic. 


The idea of this argument is that probabilistic degrees of belief can 
represent the epistemic state of an agent who bases her choices on rational 
preferences. First, a number of axioms are imposed on rational prefer- 


ences, represented by the binary relation =<. For example, it is usually 
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assumed that such preferences are transitive: if an agent prefers apples to 
bananas and bananas to cherries, then she will also prefer apples to cher- 
ries. Similarly, it is often assumed that such preferences are complete; that 
the agent either strictly prefers one of two options or she is indifferent 
between them. 

In his 1954 hallmark book “The Foundations of Statistics”, Leonard J. 
Savage sets up an entire system of such axioms, called P1-P7 (for an ac- 
cessible introduction, see Karni, 2005). They contain transitivity and com- 
pleteness as well as more demanding axioms, such as the Sure Thing Prin- 
ciple: preference between two acts merely depends on their consequences 
in those states of the world where they have different payoffs. 

Savage then proceeds to proving his famous representation theorem. 
If the preferences of an agent X satisfy the axioms P1-P7, there is a prob- 
ability function p and a real-valued utility function u (unique up to affine 
transformation) such that for any two acts f and g, with respect to a state 
space S: 


fag @ [u(f(s))dpls) < [u(g(s)) apts) 


In other words, act g is preferred to act f if and only if the expected 
utility of g, relative to the agent’s subjective degrees of belief, exceeds the 
expected utility of f. In other words, we can represent an rational agent 
as maximizing the subjective expected utility of her actions. 

Savage’s approach has been very influential in economics, and it 
bridges the gap between epistemology and decision theory in an attrac- 
tive and elegant way. However, it is not without drawbacks. First, the 
probability function p(-) describing the agent’s degrees of belief is not 
unique by itself: it is only jointly unique together with the utility func- 
tion. This weakens the appeal of Savage’s results for models of scientific 
reasoning, where pragmatic utility considerations are often thought to be 
secondary to pursuit of truth. Second, Savage’s axioms on rational pref- 
erences are not all equally compelling. For instance, Maurice Allais (1953) 
and Daniel Ellsberg (1961, 2001) conducted influential experiments that 
challenged one of Savage’s axioms, the Sure-Thing Principle (see also Al- 
lais and Hagen, 1979). Both results have been replicated over and over 
again. Hence, the decision-theoretic justification of probabilistic degrees 
of belief also fails to be conclusive. 

All this prompts the question of whether there can be a purely epis- 
temic argument for the probabilistic nature of degrees of belief, free of op- 
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erationalist and decision-theoretic considerations. The first attempt along 
these lines was made by the physicist Richard Cox in 1946, using the word 
plausibility instead of degree of belief. He demonstrated that any real- 
valued function p(-) representing the plausibility of a proposition is iso- 
morphic to a probability function if the following two assumptions (plus 


minor technical requirements) are satisfied: 


Complementarity There is a decreasing function f : IR — R such that 
p(“A) = f(p(A)). 


Compositionality There is a function g : Rx IR — R such that 
p(AAB) = g(p(A), p(B|A)) where p(B|A) denotes the plausibility 
of B if we already know A. 


In other words, if (i) the plausibility of the negation of a proposition is 
a decreasing function of the plausibility of the proposition, and (ii) the 
plausibility of A / B is determined by the plausibility of A and B given A, 
then plausibility measures obey the mathematical probability structures— 
at least up to mathematical isomorphy (Cox, 1946). p(B|A) denotes the 
degree of belief in A if we suppose that A is true, if we take A as given. 
While Complementarity is uncontroversial, the real philosophical issue is 
whether the plausibility of A / B is indeed a mere function of the plausi- 
bility of A and the plausibility of B if we suppose A. We will say more on 
this in the next section. 

In more recent times, the epistemic approach to justifying probabilistic 
degrees of belief has been resuscitated from a different perspective. James 
Joyce (1998, 2009) has made major contributions to this research project, 
based on earlier work by Brier (1950) and Rosenkrantz (1981). As a cri- 
terion for the rationality of a system of degrees of belief, Joyce evaluates 
their inaccuracy: that is, he compares our degrees of belief p(A) to the ac- 
tual truth values of the believed propositions r(A) € {0,1}. The conven- 
tional measure of inaccuracy is the Brier score taken over all propositions 


in a o-algebra A: 
B= Lp(a) ra)? @) 
AcA 
Generally, degrees of belief match the truth or falsity of the believed propo- 
sitions quite well in some possible worlds (namely, in those where proba- 
ble propositions are true) and less well in others (namely, in those where 


probable propositions are false). Joyce shows that if a system of degrees 
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of belief does not satisfy the probability axioms, then there is a probability 
function that is no less accurate in all possible worlds, and more accu- 
rate in others, as measured by the Brier score. That is, belief functions 
that violate the axioms of probability are dominated by probabilistic belief 
functions (Joyce, 1998). 

Like the other two types of arguments and Cox’s variant of the epis- 
temic argument, Joyce’s epistemic inaccuracy argument is not without con- 
troversial assumptions: Hajek (2008) points out that one of Joyce’s assump- 
tion is to assume the converse side of the Dutch Book Theorem, namely 
that probabilistic belief functions cannot be dominated accuracy-wise (see 
Maher, 2002, for further criticism). Moreover, the results are sensitive to 
the choice of the scoring rule (here: the Brier score) for calculating the inac- 
curacy of a belief function. Leitgeb and Pettigrew (2010a,b) and Pettigrew 
(2016) develop this research program further. 

We have seen that none of the above three arguments succeeds at pro- 
viding a watertight justification for why degrees of belief should obey the 
axioms of probability. However, they go a long way along this road. It 
is also notable that the same result is reached from completely different 
perspectives and methodological approaches. This suggests that the dif- 
ferent attempts provide substantial cumulative justification for modeling 
degrees of belief as probabilities. Moreover, given the fact that probabil- 
ity is a very specific mathematical structure, it is perhaps not surprising 
that we have to make substantial and sometimes contentious assumptions 
in order to obtain a unique representation of degree of belief. We will 
now proceed to the next concept which plays a central role in Bayesian 
inference: conditional probability. 


Conditional Degree of Belief and Bayes’ Theorem 


The previous section has motivated why the degrees of belief of a rational 
agent at a particular time should satisfy the axioms of probability. What 
about the dynamics of these degrees of belief? How should they change in 
the light of incoming evidence? To answer this question, Bayesians make 
use of the concept of conditional degree of belief, which we have also 
seen in Cox’s representation theorem above: the rational degree of belief 
in a scientific hypothesis H after learning evidence E is the conditional de- 
gree of belief in H given E, mathematically represented by the conditional 
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probability p(H|E). 

Before we can say more about the dynamics of Bayesian inference, we 
have to clarify what the concept of conditional degree of belief is about, 
and why conditional probability is the right explication of this concept. 
Conditional degree of belief captures the idea that we sometimes judge 
the plausibility of a proposition B in the light of another proposition A. 
For example, what is our degree of belief that Real Madrid will win the 
next Champions League if we suppose that Cristiano Ronaldo is injured 
for a period for six months? What is our degree of belief that crop yields 
will decrease by more than 50% if we suppose that there is a draught this 
summer? What is our degree of belief that there is an intelligent form of 
life outside the solar system if we make an assumption on the number of 
terrestrial planets in the galaxy? And so on. 

In other words, we adopt the counterfactual or suppositional inter- 
pretation of conditional degree of belief. This interpretation actually 
goes back to Frank P. Ramsey (1926) and Bruno de Finetti (1937). Here is 
Ramsey’s famous analysis of conditional degrees of belief: 


If two people are arguing ‘if H will E?’ and both are in doubt as 
to H, they are adding H hypothetically to their stock of knowl- 
edge and arguing on that basis about E. (Ramsey, 1926) 


The above quote is ambiguous: is it about conditional (degree of) belief 
or about the truth or acceptability conditions for indicative conditionals? 
The next sentence clarifies Ramsey’s project: 


We can say that they are fixing their degrees of belief in E given H. 
(ibid., our emphasis) 


This makes clear that regardless of the possible link to the epistemol- 
ogy of conditionals, Ramsey intended that hypothetically assuming H 
would determine one’s conditional degrees of belief in E, given H—see 
also de Finetti (1972, 2008). 

Ramsey’s analysis creates a direct link between conditional degree of 
belief and statistical reasoning. For instance, statisticians describe the 
probability of observing k heads in N independently and identically dis- 
tributed (i.i.d.) tosses of a fair coin by the Binomial probability density 
function By 1/2(k) = CC) (1/2). By Ramsey’s definition, our conditional 
degrees of belief in observing a certain number of heads or tails follow 
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these probability densities—this is just what it means to suppose that the 
coin is fair and that the tosses have been i.i.d. By supposing the fairness 
of the coin and the i.i.d-ness of the tosses, we fix our degrees of belief 
in observing two heads in three tosses at (3)(1/2)° = 3/8. That is, when 
interpreted counterfactually, conditional degrees of belief follow objective 
probabilities in statistical reasoning. 

Due to the conditional, and sometimes outright counterfactual nature 
of these statements, the concept of conditional degree of belief is qual- 
itatively different from the ordinary concept of belief. So why should 
conditional degrees of belief satisfy the axioms of probability, and how do 
they relate to unconditional degrees of belief? 

To answer the first question, consider degrees of belief in the proposi- 
tions By, Bz, etc., all of them conditional on another proposition A. Then 
the same arguments as in the previous section apply (or, from the point 
of view of a sceptic, don’t apply). After all, the Dutch Book argument in 
favor of probabilistic representation of degree of belief does not make a 
difference between whether or not a proposition A is presupposed. Sim- 
ilarly for the decision-theoretic and the epistemic arguments: all of them 
retain their normative force if applied to degrees of belief conditional on 
proposition A and the probabilistic representation p(-|A). Supposing a 
proposition A creates a new family of degrees of belief, together with a 
probability function p(-|A) that describes their coherence. 

But how do these probability functions square with the unconditional 
probabilities p(-)? Standardly, the conditional probability of an event E 
given H is defined as the ratio of the probability of the conjunction of both 
events, divided by the probability of H (assuming p(H) > 0): 


p(EAH) 


p(E|H) = PO 


(Ratio Analysis) 
This move is primarily motivated from the mathematical development of 
the theory of probability by Kolmogorov (1933), but it can also be justified 
by a Dutch Book argument. Imagine that all the probabilities in Ratio 
Analysis represent the degrees of belief of a rational agent and correspond 
to a system of associated bets. The bet on the conditional event E given H, 
abbreviated E|H, is a so-called conditional bet: it is a regular bet on E if H 
turns out to be true, and the bet is called off—that is, the stake is returned 
to the bettor with no further consequences—if H is false (de Finetti, 1937). 
It can then be shown that failure to comply with Ratio Analysis leads to 
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a Dutch Book. Consider the system of bets where an agent wages € x on 
H and €y on E/H with betting odds 1/p(H) and 1/p(E|H), while acting 
as a bookie for a bet on EA H with stake €z and odds 1/p(EAH). The 
odds correspond, of course, to the degrees of belief specified by p(-) and 
p(-|H). Assume further that the stakes satisfy the equations x = z and 
y = x/p(H). It can then be shown (proof omitted) that this system of bets 
is fair if, and only if, Ratio Analysis is satisfied. Otherwise, the agent will 
either be left with a sure loss or with a sure gain: the paradigm case of a 
Dutch book. For these reasons, Ratio Analysis is unanimously accepted as 
a constraint on conditional degrees of belief. 

A consequence of Ratio Analysis is Bayes’ famous theorem. If we com- 


bine (EAH) 
PIEA 
E|H) = ——_— 
with the cognate equation 
p(EAH) 
H|E) = ———— 
p(H|E) = 
then we obtain, by a matter of simple substitution, Bayes’ Theorem: 
p(E|H) 
HIE) = p(H (3) 
PAE) Pe) 


This equation will accompany us throughout the book—it describes how 
the degree of belief in H given E relates to the unconditional degrees of 
belief in H and E, and to the conditional degree of belief in E given H. 
Note that it does not describe how agents should change their degrees of 
belief in H when learning E: it relates the probability function p(-), that 
represents the agents’ unconditional degrees of belief, to the probability 
functions p(-|H) and p(-|E), which represent the agent’s conditional de- 
grees of belief. Acting as such an epistemic coordination principle is the 
philosophical significance of Bayes’ theorem. 

It is also possible to write the right hand side of Bayes’ Theorem 
slightly differently: 


_ (, . PCH) p(E|-H)\™ 
porte) = (1+ ay oa ) 2 


In this formulation, the dependency of p(H|E) on p(E) is replaced by a 


dependency on p(E|—=H). In many cases of scientific and in particular sta- 
tistical inference, the latter quantity can be accessed and calculated more 
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easily than p(E). We will make use of both (3) and (4) frequently through- 
out the book. 

We would also like to discuss a tempting proposal, namely to read Ra- 
tio Analysis as a definition of conditional probability, rather than a math- 
ematical constraint. In fact, this is a road taken my many textbooks (e.g., 
Earman, 1992; Skyrms, 2000; Howson and Urbach, 2006). Transferred to 
conditional degree of belief, this would mean that the conditional degree 
of belief in E given H is just the ratio of the unconditional degrees of be- 
lief in EA H and H. In this case, Bayes’ Theorem is indeed a theorem of 
mathematics without philosophical import. 

In our view, a conditional probability that is reduced to unconditional 
probability would have trouble to describe conditional degrees of belief. 
First of all, the conditional probability of E given H cannot be calculated 
when the unconditional probability of H, p(H), is zero. But intuitively, 
such conditional probabilities make sense when H is an idealized, but 
almost certainly false hypothesis (e.g., “this particular coin is fair”, “these 
two random variables have the same variance”, etc.). Similarly, intuitively 
meaningful questions such as “What is the probability that a point on 
Earth is in the Western hemisphere (H), given that it lies on the equator 
(E)?” cannot be answered if Ratio Analysis is an exhaustive analysis of 
conditional probability. At least, more technical detail has to be provided. 
Hajek (2003) gathers a lot of such criticisms in order to make a case against 
Ratio Analysis as a definition of conditional probability while Easwaran 
(2011c) and Myrvold (2015) explore avenues for parrying Hajek’s criticism. 

Second, Ratio Analysis fails to grasp the normative role of conditional 
degree of belief in Bayesian inference. Often, it is part of the meaning of H 
to constrain p(E|H) in a unique way. For determining our rational degree 
of belief that a fair coin yields a particular sequence of heads and tails, it 
does not matter whether the coin in question is actually fair. Regardless of 
our degree of belief in that proposition, we all agree that the probability 
of two heads in three tosses is 3/8 if we suppose that the coin is fair. This 
sentence has a distinctly analytical flavor whereas the degrees of belief 
in EAH and H are prima facie unrestricted. If Ratio Analysis is all that 
we can say about conditional probability and conditional degree of belief, 
then this feature of conditional degree of belief drops out of the picture 
(see also Edgington, 1995). 


Third, as a matter of psychological fact, we do not form conditional 
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degrees of belief via the conjunction of both propositions. It is cognitively 
very demanding to elicit our degrees of belief in E/H and H, and to 
calculate their ratio. Indeed, recent experimental evidence suggests that 
Ratio Analysis is a poor description of how people reason with conditional 
probabilities, pointing out the necessity of finding an alternative account 
(Zhao et al., 2009). 


For all these reasons, Ratio Analysis is not suitable as a definition 
of conditional probability and conditional degree of belief. Arguably, 
the suppositional understanding of conditional probability is also better 
suited for scientific reasoning, e.g., degrees of belief in the outcomes of an 
experiment that is described by a statistical hypothesis. Indeed, several 
mathematicians, epistemologists and philosophers of science have pro- 
posed to understand conditional probability as a primitive concept (Renyi, 
1970; Popper, 2002; Hajek, 2003; Maher, 2010). The unconditional proba- 
bility of A can then be defined as the probability of A conditional on a 
tautological proposition. This move does justice to the intuition that if 
conditional degree of belief cannot be reduced to unconditional degree 
of belief, as we have argued above, then there is actually a set of prob- 
ability functions which describe the agent’s epistemic attitudes: p(-|A), 
p(-|B), and so on. Ratio Analysis and Bayes’ Theorem are then more than 
a mathematical fact about a probability function: they can be interpreted 
as requirements on how conditional and actual degrees of belief, and the 
various probability functions that represent them, should cohere. By tak- 
ing conditional probability as a primitive concept, the variety of probabil- 
ity functions is unified in a single two-ary function p(-|-). Another option 
consists in unifying conditional and unconditional probabilities under the 
umbrella of the concept of a conditional expectation, and a g-algebra that 
is conditional on a random variable. For this rather technical route, see 
Gyenis et al. (2016); Rédei and Gyenis (2016). 


The suppositional interpretation of conditional probability has far- 
reaching philosophical implications which deserve a detailed treatment, 
but cannot be covered in this book (for discussion, see Sprenger, 2016a). 
As later chapters will show, the suppositional approach allows for fruitful 
applications of Bayesian reasoning to the topics of confirmation, old evi- 
dence, causality, and scientific objectivity. We will now return to the ques- 
tion of how the concept of conditional degree of belief provides Bayesian 
inference with a rule for updating degrees of belief. 
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The Dynamic Dimension of Bayesian Inference: 
Bayesian Conditionalization 


The previous sections have explained the static dimensions of Bayesian 
inference: representing degrees of belief in terms of probabilities and co- 
ordinating unconditional with conditional degrees of belief. To obtain a 
full-fledged theory of reasoning with degrees of belief, we also need a 
principle that states how degrees of belief are changed in the light of in- 
coming information. This is actually very simple: The rational degree of 
belief in hypothesis H after learning evidence E is expressed by the condi- 
tional probability of H given E. 


Bayesian Conditionalization The rational degree of belief in a proposi- 
tion H after learning evidence E, represented by the probability func- 
tion p’,(-), is the conditional probability of H given E according to 
the agent’s original degrees of belief represented by the probability 
function p(-): p>(H) = p(HIE). 


The principle of Bayesian Conditionalization often figures as a corner- 
stone of Bayesian reasoning (e.g., Earman, 1992, 34). It is inspired by the 
same idea that motivated conditional degrees of belief: when we learn 
a piece of evidence E, we add it to our background knowledge and see 
which consequences this addition has for the rest of our epistemic at- 
titudes. This is why the new degree of belief in H is set equal to the 
conditional probability of H given E. 

By means of Bayes’ Theorem, presented in the previous section, we can 
express Bayesian Conditionalization as follows (see Equations (3) and (4)): 


trey ey PEL) 
- p(-H)_ p(B|-H)\ 7 
= (14 pe) NEY) 


In this equation, p(H) and p(H|E) are called the prior probability and 
posterior probability of H, while p(E|H) and p(E|—=H) are called the like- 
lihoods of H and —H on E, that is, the probability of the observed evi- 
dence E under a specific hypothesis, in this case H or 7H. This version 
of Bayesian Conditionalization is especially useful for applications in sta- 
tistical inference, where the statistical model often provides us with the 
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likelihood function p(E|H,), for hypotheses indexed by some parameter 
6. Bayesian Conditionalization provides a way of learning novel evidence 
by means of trading off the prior probability of hypothesis H with the 
likelihoods of H and —H on the evidence. 


Agents who initially have different degrees of belief, represented by 
probability functions p;(-) and p2(-), are brought closer to each other if 
they both follow Bayesian Conditionalization as an updating rule. As long 
as they agree on the propositions which obtain measure zero (in other 
words, pi(X) = 0 = p(X) = 0, and vice versa), the distance between 
pi and pz will approach zero. In other words, individual differences will 
eventually cancel out if both agents are Bayesian conditionalizers. For 
more discussion of this convergence of priors literature, see Blackwell and 
Dubins (1962), Gaifman and Snir (1982) and Earman (1992). 


Essentially, Bayesian Conditionalization forges together learning E, as 
described by p},(-), and supposing E, as described by p(-|E). However, 
in spite of the intuitive similarity between learning E and supposing E, it 
is not easy to justify this equality. After all, there are nontrivial psycho- 
logical differences: Zhao et al. (2012) found, in a recent experiment, that 
participants who learned evidence E (e.g., by observing relative frequen- 
cies) submitted different probability estimates of H than participants who 
had to suppose that E occurred. Given this discrepancy on the descriptive 
level, we have to come up offer a convincing normative argument in favor 


of Bayesian Conditionalization. 


A standard proposal, much similar to what we have seen before, con- 
sists in dynamic Dutch Book Arguments (Teller, 1973). Consider an agent 
who knows in advance that an observable X will take one of the values 
x1, X2, X3,...For instance, E could be the outcome of the toss of a die with 
possible results being in the set {1,...,6}. Assume that p(X = x,) > 0 
and that p’(A) < p(A|X = x1) for some proposition A, where p’(-) de- 
scribes the agent’s degrees of belief in case X = x, is learned. Assume 
further that the agent engages on the following system of bets: she buys a 
conditional bet on A given X = x1, described by the odds 1/p(A|X = x1), 
she buys a bet on X = x1 with a very small stake, and she will sell a bet 
on A if X should happen to take the value x;. This last bet is described by 
p'(A)—the agent’s degrees of belief after learning X = x;—, and its stake 
is slightly higher than the conditional bet on A. Teller (1973) showed that 
such a system of bets leads to a sure loss for the agent: either she will lose 
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her bet on X = x, and the other two bets will be called off, or she wins 
this bet, but the gain will be compensated by the safe loss that the two 
other bets yield. See also Easwaran (2011a, 316). 

One problem with the dynamic Dutch Book Argument consists in the 
fact that it requires the agent to fix in advance which bets she is going to 
accept in the future if she happens to learn a certain fact about the world. 
In other words, the dynamic Dutch Book Argument is a sanity check for 
the preferences and commitments of an agent, instead of a proof of the 
irrationality of following another updating rule. Moreover, the scope of 
Teller’s argument and its successors (e.g., van Fraassen, 1989; Lewis, 1999) 
is often restricted. 

As a tool for learning from experience, Bayesian Conditionalization is 
also somewhat restricted. For example, it does not describe how we up- 
date our degrees of belief in the light of information whose propositional 
status is unclear, e.g., indicative conditionals. We will address this partic- 
ular challenge in the first variation. Moreover, sometimes we do not learn 
that evidence E has occurred with certainty, just that it is highly likely. For 
instance, a look at the weather forecast may shift our probability distribu- 
tion over E = “it will rain tonight” and its negation from p(E) = 1/2 to 
p'(E) = 9/10. How should our belief in other propositions, such as H = 
“the sun will shine tomorrow” change in the face of such evidence? Of 
course, we could update on the second-order proposition that the proba- 
bility of E has changed, but such a move would involve great complica- 
tions and leave the object language L. And even then, the implications for 
the posterior probability of H would not be clear. 

To solve this challenge, Jeffrey (1971) has argued that the posterior 
probability of hypothesis H after learning E, p’(H), should obey the equa- 
tion 


p (A) = p'(E) p(AJE) + p’(7E) p(H|-E) (JC) 


whenever the following two equations are satisfied: 


p(HIE) = p’(HIE) p(H|-E) = p'(H|7E) (Rigidity) 


The first equation computes the new degree of belief in H as the weighted 
average of the conditional degrees of belief of H given E and given —E, 
weighted with the degree of belief that E occurred. (JC) or Jeffrey Con- 
ditionalization follows from the Law of Total Probability together with 
(Rigidity). In a recent paper, Schwan and Stern (2016) argue that (Rigidity) 
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holds whenever E screens off H from the the propositional content D of 
the learning experience, that is, p(H|D, +E) = p(H| +E). Obviously, Jef- 
frey Conditionalization reduces to Bayesian Conditionalization when E is 
known for certain, that is, when p’(E) = 1. 


Diaconis and Zabell (1982) have also demonstrated that Bayesian Con- 
ditionalization is just a special case of a more general updating principle, 
namely minimizing the Kullback-Leibler divergence between prior and 
posterior probability distribution under the constraint p'(E) = 1. Varia- 
tion 1 reproduces Diaconis and Zabell’s proof, discusses their approach in 
more detail and applies it to learning conditional information. A similar 
result can be shown for Jeffrey Conditionalization when p'(E) < 1. That 
is, Bayesian learning can be represented as conservative belief revision: 
Bayesians change their degrees only in so far as newly learned constraints 
on their degrees of belief (e.g., p’/(E) = 1) force them to do so. Or in 
other words, they stay as close as possible to their prior degrees of belief 
as these constraints allow them to do. This principle is also entrenched in 
non-quantitative theories of dynamic reasoning, such as the AGM-model 
of belief revision (Alchourrén et al., 1985), which operates on the binary 
level of belief and disbelief. Coherence with established reasoning models 
in other domains of science and philosophy may be seen as a distinct plus 


for Bayesian inference. 


Hence, while it is difficult to give a fully compelling and conclusive jus- 
tification of changing degrees of belief in a particular way, the above argu- 
ments provide a strong cumulative case for Bayesian Conditionalization. 
This is especially interesting since the motivations come from different 
directions: one comes from the operationalist, decision-theoretic corner, 
and another one from a principle of epistemic conservativity which is also 
used in qualitative models of belief revision. 


Let us wrap up by adding a third dimension to Bayesian inference. We 
have talked a lot about the mathematical axioms that govern the statics and 
dynamics of rational degree of belief. But so far, we were silent on how 
degrees of belief should inform rational decisions. In many applications of 
Bayesian reasoning, it is emphasized that the goal of a Bayesian inference 
is the calculation of posterior probabilities. This is also the main result of 
many approaches to rational choice in economics: posterior probabilities 
are combined with subjective utilities in order to make an optimal choice 
(Savage, 1972). Under a certain set of assumptions on rational preferences, 
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it can be shown that all relevant information for making rational deci- 
sions (in the economic, instrumental sense of rationality) is contained in 
an agent’s posterior probability distribution. This is the action-related di- 
mension of Bayesian inference. We therefore identify the core of Bayesian 
inference as the conjunction of the following principles: 


Static Dimension Rational degrees of belief of an agent are represented 
by probability functions. 


Dynamic Dimension Bayesian Conditionalization (or some generaliza- 
tion thereof) prescribes how a rational agent should revise her de- 
grees of belief. 


Action Dimension The posterior probability distribution is the rational 
basis for assessing evidence, accepting hypotheses and making deci- 


sions. 


At least one of these principles is endorsed by anyone who calls herself 
a Bayesian. However not all Bayesians, or scientists who apply Bayesian 
methods, agree with all three principles (for a survey, see Weisberg, 2009). 
Bayesian statisticians sometimes refuse to interpret probabilities in a sub- 
jective sense: objective prior probabilities, which are based on sym- 
metry, invariance or information minimization principles (Jeffreys, 1961; 
Bernardo, 1979a; Vassend, 2016), do not represent any agent’s degrees of 
belief and are supposed to screen off Bayesian inference from the charge of 
arbitrariness. What is more, Bayesian statisticians frequently use improper 
prior distributions which fail to sum up to one and violate the axioms of 
probability (Bernardo and Smith, 1994). However, objective Bayesians typ- 
ically accept the other two principles—Bayesian Conditionalization and 
decision-making based on posterior probabilities. 

Ironically, there are also varieties of objective Bayesian inference who 
accept the first, static principle and who reject the second, dynamic prin- 
ciple, that is, Bayesian Conditionalization (Jaynes, 1968, 2003; Williamson, 
2007, 2010). Jon Williamson suggests the following approach: rational 
degrees of belief should satisfy the axioms of probability as a matter of 
coherence. Moreover, they should be in sync with empirical constraints, 
that is, knowledge about the external world. Such knowledge can consist 
in propositions that the agent has come to know. But it also includes the 


expectation and variance of random variables, or other constraints that 
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are difficult to express in a simple propositional language. For this rea- 
son, Williamson’s approach is particularly apt for statistical reasoning. Of 
course, there will typically be more than one probability function which 
satisfies these constraints. Williamson recommends to choose the most 
equivocal, that is, the most middling distribution. This choice can be 
motivated in different ways, one of them including risk aversion. See 
Williamson (2007) and Williamson (2010, Chapter 2 and 3) for founda- 
tional motivation and Seidenfeld (1979, 1986) for two classic criticisms. 

Finally, also the link between (posterior) degree of belief and decision- 
making need not be strict. It is possible to reason with subjective degrees 
of belief and to change them by Bayesian Conditionalization, but to make 
decisions in a different way, e.g., based on frequentist statistics. Indeed, 
scientists often admit that Bayesian inference is a foundationally sound 
framework for modeling rational degree of belief, but they prefer to work 
with frequentist or descriptive statistics. A reason for that preference is the 
difficulty of coming up with meaningful prior distributions, and because 
of concerns relating to scientific objectivity (e.g., US Food and Drug Ad- 
ministration, 2010; Gelman and Shalizi, 2012; Cumming, 2014; Trafimow 
and Marks, 2015). Similarly, one may decide to assess theories on behalf of 
their confirmational track record, which may be derived from a Bayesian 
framework without being equal to a theory’s posterior probability. A 
recent representative of such an approach is Bréssel (2016) who has in- 
troduced the concept of confirmation commitments (see also Hawthorne, 
2005). More on the notion of confirmation will be said in Variation 2. 

These remarks conclude our discussion of the foundations of Bayesian 
inference. We now introduce a powerful practical tool for making Bayesian 
inference: Bayesian networks. Philosophically, this does not add any sub- 
stantial assumptions, but since Bayesian Networks will be one of our main 
tools in the remainder of the book, it is useful to explain the principles 
behind them. 


Bayesian Networks 


A Bayesian network is a directed acyclical graph (DAG) which represents 
conditional and unconditional probabilistic independencies between var- 
ious propositions. It is a useful graphical tool for describing inferential 
relations between propositions—or in a causal reading, how events affect 
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each other. Indeed, they greatly facilitate causal inference in science (Pearl, 
2000; Spirtes et al., 2000). 


© 


Figure 0.1: The Bayesian Network for the Risotto Example. 


A canonical example of reasoning with a Bayesian Network is given 
in Figure 0.1. R represents the proposition that Alice and Bob have pre- 
pared a risotto from poisonous mushrooms. A represents the proposition 
that Alice has stomach pain after eating the risotto, and B represents the 
proposition that Bob has stomach pain after eating the risotto. Assume 
that there is a probability distribution p(-) over the propositional variables 
A € {A,7A}, B © {B,7-B} and R € {R,-=R}. Throughout the book, 
we follow the convention that propositional variables are printed in italic 
script, while their instantiations are printed in roman script (Bovens and 
Hartmann, 2003). The arrows in the graph then correspond to probabilistic 
dependencies between the variables. 

In this graph, A would be called a descendant of R, and R would be a 
parent of A. Likewise for the relation between B and R. Also descendants 
of A and B would be among the descendants of R; however, R would not 
be their parent. Lack of an arrow between two nodes of the network indi- 
cates that the two variables do not depend on each other directly, but only 
via one or several intermediate variables. Conditional on these intermedi- 
ate variables, they are probabilistically independent. 

In general, the relationship between DAGs and probability distribu- 
tions can be formalized as follows: 


Parental Markov Condition The probability distribution p(-) is Markov 
relative to a directed acyclical graph G if and only if every variable 
is probabilistically independent of all its non-descendants in G, con- 
ditional on its parents. 


That is, the probability distribution over A, B and R is Markov relative 
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to the graph in Figure 0.1 if and only if 


p(A, BIR) p(AIR) - p(BIR) (5) 
p(A,BI>R) = p(AJ>R) - p(BI>R) (6) 


where R acts as the parent node of A, and B is a non-descendant of A. Our 
shorthand notation for the conditional independence of A and B, given R, 
is (A ILB)|R. 

This property is plausible for the causal interpretation that we have 
given to the network. If we already know that Alice and Bob ate a poi- 
sonous mushroom risotto, learning about Alice’s stomach pain does not 
raise or lower our probability that Bob has stomach pain. In other words, 
eating the poisonous mushroom risotto is a common cause of the stomach 
pain of both Alice and Bob. When the Bayesian network correctly rep- 
resents the causal relations between different variables, with arrows de- 
noting paths of causal influence, the Parental Markov Condition is trans- 
formed into the philosophically more substantive Causal Markov Con- 
dition: a phenomenon is independent of its noneffects, given its direct 
causes. 

The Parental Markov Condition specifies the constraints that the rela- 
tions between the nodes in the Bayesian Network G place on the proba- 
bility distribution p(-). This allows for easily reading off the conditional 
independencies. Moreover, with the help of the Parental Markov Condi- 
tion, we can calculate joint and marginal probabilities in a straightforward 
way. For instance, in the above example, we can use Equations (5) and (6) 
to write p(A,B,R) as 


p(A,B,R) = p(A, BIR) p(R) = p(AIR)p(BIR)p(R) (7) 


and analogously for all other conjunctions of +A, +B, and +R. Similarly, 
the marginal probability of A (and likewise for B) can be written as 


p(A) = Jo p(A,B,R) 
B,£R 


= ) p(AIR)p(BIR)p(R) 
B,£R 


where the sum is taken over the the different possible values of B and R 
(here: true or false). We have used the law of total probability in the first 
line and Equation (7) in the second line. 
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The above equations suggest that a joint or marginal probability can 
always be reduced to a combination of probabilities conditional on parent 
variables and probabilities of root variables, that is, variables that do not 
have parents. Indeed, in general, it will always be the case that for a graph 
G with variables {A,,..., An}: 


n 


p(A1,. nee , An) = I [p(AilPar(Ai)) 


That is, if we reason about probabilities in a Bayesian network, it suffices 
to know the base rates of the root variables and the conditional probability 
of any variable given its parents. Often, these values are much easier to 
elicit than the joint or marginal probabilities. 

The scope of applications of Bayesian Networks in science is huge, and 
it goes beyond the scope of this extremely brief introduction to list even 
the most important ones. In fact, our use of Bayesian Networks in this 
book will remain on an elementary level: to represent causal relations 
and conditional independencies between propositional variables, and to 
calculate joint and marginal probabilities in an efficient way. We now 
articulate our view of Bayesian philosophy of science. 


Bayesian Philosophy of Science 


In a classical introduction to Bayesian inference, it is claimed that “sci- 
entific reasoning is essentially reasoning in accordance with the formal 
principles of probability” (Howson and Urbach, 2006, xvii)—see also Ear- 
man (1992, 142). Personally, we find this claim too strong: as an impressive 
number of works in philosophy of science have shown, successful scientific 
reasoning patterns are extremely diverse and vary with the disciplinary 
context where they are applied, as well as with the type of problems they 
address (e.g., Hempel, 1965; Cartwright, 1979; van Fraassen, 1980; Hack- 
ing, 1983). There are also principled reasons why a purely probabilistic 
logic of scientific inference would have to be incomplete (Norton, 2016). 
What we hope to show in this book is that there are some general aspects 
of scientific reasoning that are well captured by Bayesian inference. 

In other words: describing scientists as having degrees of belief which 
are updated by Bayesian Conditionalization helps us to better understand 
how they reason about their theories, why they accept some and reject 
others. In particular, we are interested in Bayesian models of cognitive 


Contents 25 


values—that is, values that are characteristic of a good scientific theory 
(Kuhn, 1977a; McMullin, 1982; Douglas, 2009a, 2013)—such as predictive 
accuracy, explanatory power and simplicity. 

We use a spectrum of Bayesian models across the book, and we do 
so in a straightforwardly eclectic fashion. Some of our models are based 
on Bayesian Conditionalization and others on more general forms of be- 
lief change. Sometimes we import models from Bayesian statistics, some- 
times from the philosophical literature on Bayesian inductive logic. To our 
mind, this is not a problem. After all, we are not interested in defending 
Bayesian inference as the uniquely correct theory of epistemic attitudes, 
but in showing that it is a fruitful one. Hence, does Bayesian inference 
provide unexpected insights into scientific reasoning? Does it solve prob- 
lems that other models struggle with? Does it suggest interesting experi- 
ments or questions for future research? We believe that Bayesian inference 
fares well with respect to those criteria which are also standardly used for 
the evaluation of scientific models (Weisberg, 2007, 2012; Frigg and Hart- 
mann, 2012). However, no single Bayesian model will be able to succeed 
at modeling phenomena as diverse as scientific confirmation, explanation, 
intertheoretic reduction and causal effect. Diverse phenomena ask for a 
diversity of (Bayesian) models, and of course, Bayesian models may often 
have to be complemented by non-Bayesian approaches. 

We understand Bayesian philosophy of science as the use of Bayesian 
principles and methods for modeling scientific reasoning. It involves 
two different goals and methods: on the one hand, we explicate cen- 
tral concepts of scientific reasoning in a Bayesian language; on the other 
hand, we apply Bayesian inference to scientific reasoning, e.g., by build- 
ing Bayesian models of the No Alternative Argument and the No Miracles 
Argument. 

In the explicative project, we follow Carnap’s methodology of replac- 
ing a vague concept, the explicandum, by an exact one, the explicatum: 


If a concept is given as explicandum, the task consists in 
finding another concept as its explicatum which fulfils the fol- 
lowing requirements to a sufficient degree. 


(1) The explicatum is to be similar to the explicandum in such 
a way that, in most cases in which the explicandum has so 
far been used, the explicatum can be used; however, close 
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similarity is not required, and considerable differences are 
permitted. 


— 
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The characterization of the explicatum, that is, the rules of 
its use (for instance, in the form of a definition), is to be 
given in an exact form, so as to introduce the explicatum 
into a well-connected system of scientific concepts. 


—~~ 
w 
wm 


The explicatum is to be a fruitful concept, that is, useful for 
the formulation of many universal statements (empirical 
laws in the case of a non-logical concept, logical theorems 
in the case of a logical concept). 


s 


The explicatum should be as simple as possible; this means 
as simple as the more important requirements (1), (2), and 
(3) permit. (Carnap, 1950, 7) 


In the context of this book, this means that we provide a quantitative di- 
mension for central concepts in scientific reasoning, such as confirmation, 
explanatory power and causal effect. Explication involves a tight intercon- 
nection of conceptual analysis and formal methods: conceptual analysis 
leads us to adequacy conditions which the explicatum has to satisfy, while 
formal reasoning helps us to characterize the set of explicata that satisfy 
these conditions, and to prove existence and uniqueness theorems. 

Bayesian model-building, on the other hand, takes a more applied fo- 
cus. Rather than explicating a particular concept by the axiomatic method 
of adequacy conditions and representation theorems, we identify a set of 
variables that matter for a peculiar case of scientific reasoning (e.g., the 
No Miracles Argument). Then we postulate relations between them and 
we investigate what kind of interesting findings we can make on the basis 
of these assumptions. The fact that our models are framed in terms of 
rational degrees of belief makes them Bayesian. This means that our work 
in this book has a more foundational side, connected to the explicative 
project, and a more applied side, connected to Bayesian model-building. 
Notably, both the explicative and the model-building project make use of 
empirical and computational methods where appropriate: experimental 
findings are evaluated in order to judge the adequacy of an explicatum 
with respect to the concept that it targets, and computational methods are 
used for exploring the consequences of our models in cases where we fail 
to achieve strictly analytical solutions. 
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An Outline of the Book 


We conclude this introduction with an overview of the chapters of the 
book, which may be thought of as variations on the themes presented in 
this exposition. 


The first five variations center around a common theme: the confirma- 
tion of scientific theories. Scientific theories are valued to the extent that 
they make accurate predictions, and degree of confirmation quantifies the 
extent to which theories have been predictively successful. Measuring de- 
gree of confirmation is a classical task for Bayesian philosophy of science, 
since confirmation can be straightforwardly explicated in terms of increase 
in probability. But our approach is broader: we also address challenges to 
Bayesian Confirmation Theory, and we demonstrate how certain argument 
patterns in science (e.g., the No Alternatives Argument and the No Mira- 
cles Argument) can be recast as confirmatory arguments for the theory in 
question. 


Variation 1 describes how learning conditional information (e.g., if in- 
tervention X is made, result Y will occur) may confirm or disconfirm a 
scientific theory. For instance, how should the belief in a theory T change 
if we learn that it makes a particular prediction (e.g., p(E|T) = 1)? To solve 
this challenge, we use a generalization of Bayesian Conditionalization and 
conceptualize rational degree of belief change as minimizing the diver- 
gence between prior and posterior distribution, conditional on preserving 
the causal and inferential structure of the involved propositions. This al- 
lows us to capture several (counter)examples that haunt other accounts of 


learning conditional information. 


Variation 2 is devoted to a quantitative analysis of confirmation in 
Bayesian terms, in particular confirmation as increase in firmness: evi- 
dence E confirms theory T if and only if it raises the (subjective) probability 
of T. We motivate and describe the transition from qualitative theories of 
confirmation in first-order logic to quantitative, Bayesian models of confir- 
mation. We characterize these models and explain their advantages vis-a- 
vis qualitative models, especially with respect to classical challenges such 
as the paradox of the ravens, the tacking paradox and the grue paradox. 
This brings us to an analysis of the problems of Bayesian confirmation 
theory, such as the measure sensitivity of confirmation-theoretic analy- 
sis, and more generally, the plurality of Bayesian confirmation measures. 
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We finally discuss how conceptual analysis and empirical evidence can be 


combined to narrow down the class of adequate confirmation measures. 


Variation 3 discusses one of the major challenges to Bayesian confir- 
mation theory: the Problem of Old Evidence. How do Bayesians describe 
the confirmatory power of the discovery that a theory T implies evidence 
E when E has been known for a long time? According to the standard 
Bayesian model of confirmation, evidence E confirms theory T if and only 
if learning E raises the probability of T. But this is impossible if the evi- 
dence is already known (p(E) = 1). We resolve this problem by means of 
two different Bayesian models that demonstrate how explaining old evi- 
dence raises the rational degree of belief in theory T. 


Variation 4 deals with the No Alternatives Argument. Does the failure 
to find alternatives to a scientific theory confirm it? Arguments of this kind 
are often employed in support of string theory or other theories that lack 
strong empirical support (e.g., in paleontology). After all, there have been 
enormous efforts to find a viable alternative, and the failure to find one 
may license an explanatory inference for the truth (or empirical adequacy) 
of that theory. By framing the argument within a probabilistic model, we 
can show that longstanding failure to find alternatives supports a theory— 
even if the strength of the argument (i.e., the degree of confirmation it 
provides) is context-sensitive and depends on the exact circumstances. 


Variation 5 develops a probabilistic assessment of the famous No Mira- 
cles Argument in favor of scientific realism. That argument contends, in a 
nutshell, that the truth of scientific theories is the only viable explanation 
of their success. We frame the No Miracles Argument as a confirmatory 
argument: does the success of scientific theories make them more proba- 
ble? We set up various Bayesian models to answer this question, which 
also correspond to different ways of interpreting the No Miracles Argu- 
ment. These models take into account factors that have been neglected in 
reconstructions of the No Miracles Argument: the stability of theories in 
a specific discipline, and their success rate. Thus we get a better grip on 
the circumstances when the success of science supports realist inclinations, 
and when it doesn’t. 

The second set of variations abandons the topic of confirmation in favor 
of central concepts in scientific reasoning. Some of them (e.g., explanatory 
power, simplicity, corroboration, intertheoretic reduction) are also often 
cited as virtues of a good scientific theory. 
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Variation 6 develops a Bayesian analysis of causal effect, building on 
the scientific literature on causal Bayes nets and combining it with meth- 
ods from Bayesian confirmation theory. First, we defend the choice of 
a framework where different measures can be embedded and compared: 
causal Bayes nets. Second, we derive representation theorems for various 
measures of causal effect, that is, theorems that characterize a measure of 
causal effect as the only measure (up to ordinal equivalence) that satisfies 
a certain set of adequacy conditions. Third, we make an argument for 
preferring a particular measure. Finally, we apply that measure to a case 
from epidemiology, demonstrating how closely scientific and philosophi- 
cal reasoning about causal effect are intertwined. 


Variation 7 is devoted to the topic of explanatory power, a classical cog- 
nitive value in scientific reasoning. Hempel (1965) famously postulated a 
structural identity between prediction and explanation: explanations show 
why a particular phenomenon occurred by deriving it from the theory, and 
explanatory power is proportional to the ability of the explanans to ac- 
count for the explanandum (see also Hempel and Oppenheim, 1948). We 
explore to what extent this classical view, fallen out of fashion in modern 
philosophy of science, can be rescued in a Bayesian framework, where ex- 
plications of explanatory power are based on considerations of statistical 
relevance. Then we compare several of these explanatory power measures 


and their respective strengths and weaknesses. 


Variation 8 provides a Bayesian account of intertheoretic reduction. 
Ceteris paribus, theories which have broad scope and cohere with theo- 
ries at other levels of description have more value than isolated theories. 
For example, models of statistical mechanics reduce to thermodynamics 
equations in the mathematical limit. Such reductive relationships between 
theories at the phenomenal and fundamental level are described by the 
models by Nagel (1961) and Schaffner (1967). We defend these models 
against the standard criticism and show how the establishment of reduc- 
tive relationships can raise the probability of the involved theories. That is, 
we do not only show how reduction unifies different theories, but we also 
demonstrate that it has a positive effect on the assessment of the involved 
theories. 

Variation 9 brings together Bayesian models of theory assessment with 
(frequentist) hypothesis testing in science and Popper’s critical rational- 
ism. In hypothesis tests, we often observe a failure to reject the null (=de- 
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fault) hypothesis at a statistically significant level. Does this mean that the 
null hypothesis is confirmed, or as Popper said, corroborated by the re- 
sults? Can we characterize the conditions when corroboration takes place, 
and give a quantitative dimension to corroboration judgments? We first 
show why a confirmation-theoretic framework cannot provide such an ex- 
plication. Then we derive an axiomatic measure of corroboration from a 
more general set of probabilistic constraints and relate it back to principles 
of Bayesian inference. 

Variation 10 analyzes the value of simplicity in statistical inference. 
We analyze the general question of whether simplicity is a good reason to 
prefer a theory to an alternative in the context of statistical model selection. 
In particular, we are concerned with Forster and Sober’s (1994) thesis that 
simpler models are more likely to be true, or at least predictively accurate. 
We analyze model selection criteria that could support their claim, such 
as the Akaike Information Criterion (AIC) and the Bayesian Information 
Criterion (BIC) and compare them to genuinely Bayesian methods, such as 
model selection on the basis of Bayes factors. We demonstrate that a link 
between simplicity and predictive success cannot be established on the 
basis of these criteria. Furthermore, we show that Bayesian methods are 
often used in an instrumental way that is detached from the philosophical 
foundations of Bayesian inference. 

Variation 11, finally, deals with the question of whether subjective 
Bayesian inference can ever achieve a sufficient degree of objectivity to 
counter the charge of arbitrariness, and to maintain the epistemic author- 
ity of science. In particular, the irreducibly subjective nature of prior prob- 
abilities, and their inevitable impact on (supposedly objective) measures of 
evidence is often cited as a reason to mistrust Bayesian inference. However, 
such arguments often presuppose an unrealistically strong and outdated 
conception of scientific objectivity. Therefore, our strategy for countering 
this criticism is twofold: we combine an up-to-date conceptual analysis of 
scientific objectivity with formal arguments that Bayesian inference is, on 
these accounts, no less objective than its competitors. 

The book concludes with a short recapitulation of the original theme: 
we count the successes and failures of Bayesian philosophy of science, 
make up the balance, and sketch future research projects. 


Variation 1: Learning Conditional In- 
formation 


Indicative conditionals of the form “if A, then B” constitute a substantial 
part of scientific evidence. Many experiments report that a certain inter- 
vention leads to a certain effect. If sugar is thrown into a glass of water, it 
dissolves. If a mouse is infected with a certain virus, it develops a certain 
type of symptoms. If a consumer is frequently exposed to a commercial, 
he or she may be more likely to buy the advertized product. 


Scientific evidence thus often comes in the form of conditional state- 
ments. But how should we change our degrees of belief in scientific theo- 
ries when we learn such conditionals? This question has prompted a large 
amount of literature, but without conclusive results. Douven (2012) con- 
cludes that a general and feasible account of learning conditionals is still 
to be formulated. Indeed, all accounts that have been proposed so far face 
problems. Here are three popular attempts. 


First and most straightforwardly, one might identify the natural lan- 
guage indicative conditional H — E with the material conditional H 5 
E, which is equivalent to -H V E. Then, one may conditionalize on that 
proposition. Popper and Miller (1983) challenged this proposal with an ar- 
gument based on the probability calculus. It goes as follows. Consider two 
propositions H and E and a probability distribution p with 0 < p(H) <1 
and p(E|H) < 1. We now learn the indicative conditional H > E, which 
we express as the material conditional —H V E. To update our beliefs, we 
use Bayesian Conditionalization, i.e. we calculate the posterior probability 
p'(H) := p(H|-=H V E). Interestingly, it turns out that p’(H) < p(H). The 
proof is elementary; so we reproduce it here. Bayes’ Theorem implies that 


___p(A) 
p(H|-H VE) = p(-HV E|H) 5(aHV 
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and hence, it is sufficient to show that p(=H V E|]H) < p(=H V E): 


p(-HVE|H)—p(-HVE) =~ p(-H/H) + p(E|H) — p(-HAE|H) 
—(1— p(H) + p(HE)) 
=  p(E|H) — (1 — p(A)) — p(E|H) pA) 
p(E|H) (1 — p(H)) — (1 — pA) 
= (1—p(A)) (P(EIH) - 1) 
<0 


In other words, learning “if E, then H” always decreases the probability of 
H if one interprets the above sentence as the material conditional. How- 
ever, there are also cases where the posterior probability of a hypothesis is 
intuitively judged to be greater than or equal to the prior probability upon 
learning that it makes a particular prediction. We give some examples in 
Section 1.2. The naive Bayesian proposal to identify learning an indicative 
conditionals with learning the associated material conditional has trou- 
ble to account for such evidence: it does not do justice to the variety of 
conditional information that we encounter in nature. 

Second, David Lewis (1976) proposed an account called imaging, which 
requires a possible worlds semantics with similarity relations between dif- 
ferent possible worlds. On Lewis’ account, an indicative conditional is 
true if the consequens holds true in the closest possible world where the 
antecedens is true. It turns out, however, that this proposal also fails to 
do justice to some of our intuitive judgments (cf. Douven and Dietz, 2011). 
Moreover, Lewis’ imaging is an account of the semantics of conditionals; 
it is unclear how it should guide our reasoning with conditionals and the 
revision of our beliefs. 

Third, one may conclude that Bayesian Conditionalization is, as an up- 
dating rule, too restricted to account for learning indicative conditionals, 
or more generally, conditional information. Does this mean farewell to 
Bayesian philosophy of science? Not necessarily so. We may try to find 
a conservative extension of Bayesian Conditionalization: an updating rule 
which preserves the probabilistic nature of degrees of belief, which agrees 
with Bayesian Conditionalization for learning first-order propositions, and 
which also covers the learning of conditionals. This is our approach in 
this variation. We propose minimizing Kullback-Leibler divergence be- 
tween posterior and the prior probability distribution as an extension 
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of Bayesian Conditionalization that is able to account for the persistent 
problem of learning conditional information (e.g., indicative condition- 
als). More precisely, we argue that minimizing divergence between prior 
and posterior probability distribution delivers intuitively correct results 
for learning conditional information if the posterior distribution is also 
required to respect causal and inferential constraints provided by the 
context of the examples in question. 

The remainder of this variation is organized as follows. Section 1.1 
introduces KL divergence minimization and shows its equivalence to 
Bayesian Conditionalization for updating on standard evidence (i.e., learn- 
ing a first-order proposition E). Section 1.2 challenges this account by 
developing three examples which the divergence minimization method 
struggles to model adequately. Section 1.3 shows how these challenges 
can be met if the divergence minimization method is extended and prop- 
erly applied. The entire variation, and this section in particular, builds on 
the theoretical innovations presented in Hartmann and Rafiee Rad (2016) 
and applies them to scientific reasoning. Finally, Section 1.4 takes stock 
and comments on the scope of our proposal while Section 1.5 contains 
the proofs of our results. From now on, we drop the adjective “indica- 
tive” and the noun “conditional” is always meant to refer to an indicative 


conditional. 


The Kullback-Leibler Divergence and Probabilistic 
Updating 


Bayesian Conditionalization is a powerful, but somewhat limited tool for 
changing one’s belief in the light of new evidence. As we have motivated 
at the beginning of this variation, not all scientific evidence comes in the 
form of a first-order proposition that we can easily learn and represent in 
our object language. Indicative conditionals, whose propositional status 
is very much disputed in the literature (e.g., Edgington, 1995, 2014), are a 
case in point. Other examples are learning the mean value of a random 
variable, learning measurement variance, etc. It is far from clear how con- 
ditionalization on sentences such as “the mean of variable X is between 
13.2 and 15.4” should work. This calls for a more general updating rule 
where these sentences, instead of being the objects of conditionalization, 
constrain probability functions that represent our rational degrees of be- 
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lief. 

The Kullback-Leibler (KL) divergence Dx ,(p’||p) plays such a con- 
straining role. It has been introduced in the context of transmitting electric 
signals using a binary code (Shannon, 1949). In the next two paragraphs, 
we motivate the particular mathematical form of KL divergence. Notably, 
the results in this variation also hold for divergence functions that differ 
from KL divergence, e.g., Hellinger distance. Hence, nothing substantial 
depends on the choice of this particular measure. 

Kullback-Leibler divergence is relevant in the context of finding an 
cost-efficient code for transmitting a string S whose tokens (e.g., letters 
of the alphabet) occur with varying frequency, expressed by probability 
distribution p’. The less tokens we need to transmit the string, the better. 
Assume that we use a binary code C where frequently occurring letters 
such as “e” and “a” are coded by short sequences of bits such as ”0” and 


“sl x” 


“10”, and infrequently occurring letter such as are coded by long se- 
quences such as “11111110”. The length of the bit sequences implicitly 
defines a probability distribution p over the elements of our code: namely 
that distribution of tokens which is optimally encoded by C. In the above 
example, where a ‘0’ denotes transition to the next token, that would be 
p(“e") = 1/2, p(“a") = 1/4, p(“x") = 1/64. You get the idea. KL diver- 
gence measures the loss in efficiency by using a code with probability dis- 
tribution p, when the real frequency of the tokens in S follows distribution 
p’. Efficiency is expressed by the expected excess length of transmitting S 
by our code C, instead of using an optimal code. 

Assume that s1,...,S, denote the tokens of the string S which we try 
to transmit. How should we measure the loss in efficiency when trans- 
mitting S with code p although the true frequency distribution of the s; is 
described by p’? The Kullback-Leibler divergence calculates this difference 
as follows: 


= i ges) 
D : 1.1 
KL(p'||p) : = LPs) 8 n(s:) (1.1) 


In this formular, we quantify the loss in efficiency for each token s; by 
means of the logarithmic difference log p’(s;) — log p(s;), which is equiva- 
lent to log(p'(s;)/p(s;)), and we use the actual frequency of each token for 
weighting these losses. If the logarithm is taken at base two, this means 
that Dxz(p’||p) expresses the expected number of bytes that we will have 
to invest for transmitting S with p instead of the optimal code p’. 
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Kullback-Leibler divergence can be generalized beyond the 
information-theoretic context and be regarded as a general expres- 
sion of divergence between two probability distributions. Note that the 
base of the logarithm will not matter for the results that we show in this 
variation. Moreover, Dxz(p’||p) will be zero if and only if p’ and p agree 
on all elements of the state space. 

The Kullback-Leibler divergence implicitly defines a method for up- 
dating degrees of belief: when learning a set of constraints €, one should 
adopt the probability distribution p’(-) which has, among all distributions 
that satisfy €, the smallest KL-divergence to the old distribution p(-). In 
other words, we change beliefs as much as required by the learned set of 
empirical constraints, but no more: like theories of belief revision such 
as AGM (Alchourron et al., 1985; Makinson, 1985), minimizing Kullback- 
Leibler divergence is inherently conservative. Moreover, the divergence 
minimization method can also handle much more general constraints than 
Bayesian Conditionalization. Yet, the two forms of updating are closely re- 
lated and often equivalent, as we shall show in this section. This result was 
first proved by Diaconis and Zabell (1982). 

Consider two binary propositional variables, H and E. We represent 
the probabilistic dependence between H and E in the Bayesian Network 
depicted in Figure 1. To complete it, we fix the prior probability of the 
root node H, i.e. 

k= p(A) (1.2) 


and the conditional probabilities of E, given the values of its parent H: 
p= p(E|H) , 4 := p(E|“H) (1.3) 


Next, we learn that the evidence E obtains. This is a constraint on the 


a 


Figure 1.1: The Bayesian Network representation of the relation between 
A and E. 


posterior probability distribution p’ which amounts to 
p'(E) =1. (1.4) 
Now we make analogous definitions for the variables h’, p’ and q’: 


h' := p(H) 
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p’ == p’(E|H) q = p'(E|>H) 
Calculating the Kullback-Leibler divergence between p’ and p yields 


Da(Plip) = Lv. B)log 


W\ — (it 
= h'log (| — | +h'log { — 
= (ia)+ = @ 


/ = Lr 1 
=) i log +h log = +h! log ; +log-. (15) 


where we have used the convenient shorthand h := 1—h, which we 

will use throughout the book. We have also used the shorthand notation 

p(H,E) for p(H AE) which we will also use below when appropriate. 
With the help of Equation (1.5), we can show the following theorem: 


Theorem 1.1 (Diaconis and Zabell, 1982) Let p(-) be a probability distribu- 
tion over propositions of a propositional language L. Suppose we learn sentence 
E. Then the following two updating rules for the posterior distribution p’ are 
equivalent: 


1. Bayesian Conditionalization on E: p'(-) = p(-|E). 


2. Minimizing Kullback-Leibler divergence Dx, (p'||p) subject to the con- 
straint that p'(E) = 1. 


This result, although far from being novel, is deep and interesting. It 
shows that Bayesian Conditionalization on proposition E is the updating 
method which minimizes the change from the current probability dis- 
tribution if (i) the posterior probability distribution p’ is constrained by 
p'(E) = 1; (ii) the amount of change is measured by Kullback-Leibler di- 
vergence. This result can be interpreted as saying that Bayesian Condition- 
alization does not require us to change our beliefs more than necessary. 
Let us now explore whether this method can also be applied to finding 
a suitable probability distribution after having learned a conditional. To 
apply the proposed method, one has to derive a probabilistic statement 
from the learned conditional. Here we follow Douven (2012) and oth- 
ers and represent learning the conditional H — E as a constraint on the 
conditional posterior probability p’(E|H). In particular, we assume that 
learning H — E implies p'(E|H) = 1. That is, we are not interested in 
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the probability of the conditional itself. Instead, we are interested in the 
effects of learning a conditional on the relevant conditional probabilities. This 
assumption is compatible with agnosticism about the propositional status 
of conditionals. It is also weaker than Stalnaker’s Thesis which identifies 
the probability of a conditional with its conditional probability (Stalnaker, 
1968, 1970, 1975), and which is vulnerable to triviality arguments in the 
style of Lewis (1976). Our approach can also be motivated from the Ram- 
sey test for conditional probabilities, which evaluates them by hypothet- 
ically adding the antecedens to the background knowledge—see page 11 
in the introduction. In other words, if we already know H — E and add 
H to our stock of background knowledge, E will be a certainty by Modus 
Ponens. Hence, p(E|H) = 1. 

To test this method, consider the Bayesian Network depicted in Figure 
1.1. In this scenario, we learn the conditional H — E, which implies that 


pH) 9 1, (1.6) 
The Kullback-Leibler divergence between p’ and p is then given by 


Dalle) = Sop'(H.E) log 


' = Rrat\ . = Ghar 
= h' log (Gs) ae (x log (7 +q' log & )) (1.7) 
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A simple algebraic proof then suffices for the following theorem: 


Theorem 1.2 Let E, H be two sentences of a propositional language L with prob- 
ability distribution p(-). Suppose we learn H — E and we construct the posterior 
probability distribution by minimizing Kullback-Leibler divergence Dxz(p'||p) 
in p', subject to the constraint that p'(E|H) = 1. Then, if p(E|H) < 1, it will be 
the case that p'(H) < p(H). 


This result may sound wrong at first sight. After all, we only learn that 
H has E as a consequence and nothing else. So why should this prompt us 
to change our belief in H? And why should the probability of H decrease? 
Note, however, that H becomes more informative after having learned the 
conditional. It makes a prediction on E. It is therefore natural to set the 
new probability of H to a lower value as more informative hypotheses take 
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more risk to be mistaken. This point was already made by Popper (2002) 
and Hempel and Oppenheim (1948). They stressed that being informa- 
tive is a scientific virtue which may contribute to the acceptance of H, but 
it is not correlated with posterior probability, quite to the contrary. The 
result of Theorem 1.2 agrees, by the way, with Popper and Miller’s diag- 
nosis for learning the material conditional H > E by means of Bayesian 
Conditionalization (Popper and Miller, 1983). 

We have seen that minimizing Kullback-Leibler divergence leads to 
reasonable results for situations involving two propositional variables. But 
does it also work for more complicated scenarios? 


Three Challenges for Minimizing Divergence 


In a variety of papers, Richard Dietz, Igor Douven and Jan-Willem Romeijn 
developed examples which challenge the divergence minimization method 
for learning conditionals. Below, we adapt those examples to a context of 
scientific reasoning. Each example starts with a story that sets up the 
scene. Then a conditional is learned which may prompt some previously 
held beliefs to change. 


1. The Medicine Example. A general practitioner has to choose whether 
or not to give drug D to a patient. She will administer D if and only 
if (i) D is effective against the strains of bacteria that the patient is in- 
fected with; (ii) the patient has no medical condition that makes him 
susceptible to serious side effects of D. The GPs’s assistant checks the 
patient’s medical history and tells her boss: “If D is effective against 
that strain of bacteria, then we should administer D.” Upon learning 
this conditional, the GP does not change her belief in the efficacy 
of D—rather, she reasons that her assistant has checked whether the 
patient is sensitive to side effects of D. Thus learning the conditional 
leaves the probability of the antecedens unchanged. This counterex- 
ample to Theorem 1.2 is adapted from Douven and Romeijn (2011). 


2. The Astronomy Example. The astronomic community in the 16th cen- 
tury considered two general theories, the Copernican model and the 
Ptolemaic model. An astronomer observes the movements of the 
planets Mars, Jupiter and Saturn over an extended period and notes 
down his observations. He finds an agreement between periods of 
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retrograde motion and relative brightness. He now works himself 
through the implications of the Copernican model and he realizes: 
“Tf the Copernican model is true, then the outer planets (=Mars, 
Jupiter and Saturn) will display retrograde motion when they are 
close to Earth.” Already knowing that the (apparent) retrograde mo- 
tion of these planets would agree with his actual observations be- 
cause brightness is an indicator of spatial proximity, he now finds 
it more likely that the Copernican model is true. In this example, 
learning the conditional information should intuitively increase the 
probability of the antecedens of the conditional. This example is 
adapted from Douven and Dietz (2011). 


3. The Economics Example. An economist is interested in whether a 
country is recovering economically. During the Christmas period, 
she surveys the sales volume of several warehouses. It turns out to 
be low. She asks a colleague about the consequences of economic re- 
covery on consumer income. Her colleague answers: “If there is an 
economic recovery going on, consumer’s income has increased”, e.g., 
because of generous end-of-the-year bonuses. Upon learning this 
conditional, the economist thinks it is doubtful (even if not wholly 
excluded) whether an economic recovery is currently going on. As 
a result, she lowers her degree of belief for economic recovery and 
thus decreases the probability of the antecedens of the conditional. 
This example is adapted from Douven (2012). 


These three cases describe different ways how learning a conditional 
may affect the probability of the antecedens: it may be lowered, increased, 
or remain unchanged. This is bad news for the divergence minimiza- 
tion method which claims that the probability of the antecedens always 
decreases—see Theorem 1.1. Does this mean that the project of finding a 
general theory of learning conditionals is futile or doomed? We disagree 
and show how the divergence minimization method can successfully deal 
with the above examples when plausible causal and inferential constraints 
are imposed on the posterior probability distribution. 
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Meeting the Challenges 


To meet the three challenges presented in the previous section, we adopt 
the following methodology. First, we identify all relevant variables of the 
problem at hand and the causal and inferential relations that hold between 
them. Second, we represent causal and inferential relations between the 
variables by (conditional) independencies in a Bayesian Network and fix 
the prior probability distribution p that is associated with that network. 
Third, we express the learned conditional as a constraint on the posterior 
probability distribution p’ and assume that the relevant independencies are 
not changed by learning the conditional. That is, they are constraints on the 
prior and the posterior probability distribution, e.g., because they express 
a certain causal structure. From the story it is clear that the incoming infor- 
mation does not overturn the structure; hence, we should preserve it as a 
constraint on the posterior probability distribution. Fourth, we obtain the 
posterior probability distribution p’ by minimizing the Kullback-Leibler 
divergence Dx,(p'||p) to the prior distribution p. Fifth, we check whether 
the results comply with our intuitions. To repeat, in comparison to stan- 
dard updating by minimizing Kullback-Leibler divergence, we have now 
imposed the additional constraint on the posterior distribution that causal 
structure (e.g., which interventions affect which variables) and inferential 
relations (e.g., which variables are probabilistically independent of others) 


remain intact. 


The Medicine Example 


We introduce three binary propositional variables to represent the 
medicine example. The variable E has the values E: “Drug D is effec- 
tive for the bacteria the patient is infected with”, and E: “Drug D is not 
effective for these bacteria”. The variable S has values S: “The patient is 
susceptible to serious side effects when taking drug D”, and —S: “The pa- 
tient is not susceptible to serious side effects when taking drug D”. Finally, 
the variable A has the values A: “Administer drug D”, and =A: “Do not 
administer drug D”. 

Before we proceed, let us show that using the material conditional and 
Bayesian Conditionalization leads to an intuitively wrong result. To do 
so, remember that the learned conditional is “if drug D is effective against 
these bacteria, then administer it”, expressed as E —> A. We interpret this 
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as E > A which is equivalent to =E V A. Assuming 0 < p(E), p(E,7=A) <1 
and using the ratio analysis of conditional probability, we then obtain 
p(EA (AEV A)) p(EA A) 


PEE) = ee) == nea) = CE 


p(E) — p(E, 7A) 
ip AY (1.8) 


It is then easy to verify that Equation (1.8) requires p’(E) = p(E|E > A) < 
p(E), which conflicts with our intuitive judgment that the probability of 
the efficacy of the drug should remain unchanged. 

Let us now show how our suggested methodology deals with the case. 
The story suggests a number of dependencies and independencies be- 
tween the various variables. The Bayesian Network in Figure 3 repre- 
sents the probabilistic dependencies and independencies between these 
variables. The arrow represent the causal relations and the effect of inter- 


ventions on the variables. 


@ 


Figure 1.2: The Bayesian Network for the Neuroscience Example. 


To complete the Bayesian Network, we have to fix the prior probability 
of the root nodes and the conditional probabilities of all other nodes, given 
the values of their parents. We set 


e:= p(E) s:= p(S) 
and 
a := p(AE,S) 6 := p(AlE, -S) 
1 = p(AIFE,S) 6 := p(A|>E, “S) 


Given the punch line of the story, we may assume that 


B = p(A|E,—7S) =1. (1.9) 
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That is, if the drug is effective and the patient not susceptible to side 
effects, we will administer the drug. All other conditional probabilities (ie. 
B, y and 6) are in the open interval (0,1). Let us now consider the posterior 
probability distribution p’, which is defined over the same variables as the 
prior distribution. The constraint 8 = 1, due to Equation (1.9), should be 
preserved in that distribution, too. Hence we conclude 


p' := p'(A|E,-S) =1 B' := p'(>AIE,-S) =0 (1.10) 


Another constraint on the posterior probability distribution is the learned 
conditional “if the drug is effective against these bacteria, then administer 
it”, which implies that 

p' (AE) =1 (1.11) 


and hence p’/(=A|E) = 0. Assuming that all unconditional probabilities 
are in the open interval (0,1), we can apply the ratio analysis of conditional 
probability and infer that p’(4A,E) = 0 which implies in turn that 


p'(7A,E,-S) = p!(sAIE,-S) p'(E) p'(-8) = 0 
p'(>A,E,S) = p'(AlE,S) p'(E) p’(S) = 0 


The first equation is satisfied since B’ = p'(=A|E, 7S) = 0, as we noted in 
Equation (1.10). Regarding the second equation, one of the factors on the 
right hand side must be zero. We can safely assume that p’(E) > 0 (why 
should learning the conditional E — A rule out that the drug is effective?) 
and also that a’ = p’(=A|E,S) > 0: if the patient is susceptible to side 
effects, it is not clear whether the GP still administers drug D. Hence s’ := 
p'(S) = 0. This makes sense: the information received from the assistant 
suggests that the patient is not susceptible to serious side effects. We can 
now show the following theorem: 


Theorem 1.3 Consider the Bayesian Network in Figure 1.2 with the prior prob- 
ability distribution p from equations (1.41). We furthermore assume that 


(i) the posterior probability distribution p' is defined over the same Bayesian 
Network and respects the same independence assumptions; 


(ii) the learned conditional is modeled as the constraint (1.9) on p’, that is, 
p(A|E,S) = 1, implying p'(S) = 0; 
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(iii) p' minimizes the Kullback-Leibler divergence to p. 


Then p'(E) = p(E). 

We conclude that the proposed method yields the intuitively correct 
result in this case: the probability of the drug being effective is invariant 
under learning the conditional E — A. 


The Astronomy Example 


Again, we introduce three binary propositional variables to represent the 
astronomy example. The variable C has values C: “The Copernican model 
is true”, and —C: “The Copernican model is false”. The variable M has 
the values M: “The outer planets display retrograde motion when close 
to Earth”, and —=M: “The outer planets do not display retrograde motion 
when close to Earth”. Finally, the variable O has the values O: “Periods 
of retrograde motion and relative brightness agree”, and =O: “Periods of 
retrograde motion and relative brightness do not agree”. The Bayesian 
Network in Figure 4 represents the probabilistic dependencies and inde- 
pendencies between these variables: O depends on C only via M, or in 
other words: O ILC|M. 


oo -0 


Figure 1.3: The Bayesian Network for the Astronomy Example. 


To complete the Bayesian Network, we have to fix the prior probability 
of C, ie. 
a:= p(C), (1.12) 


and the conditional probabilities 
pi = p(MIC) q i= p(M|-C) 
p2 ‘= p(O|M) gz = p(O|-M) 


We can now calculate the prior probability distribution over the vari- 
ables C, M and O: 


p(C,M,O)=apip2 , p(C,M,-O) =api pr 
p(C, 7M, O) = «p1q2 y p(C, 7M, -O) = «7p, 9, 
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p(-C,M,O) = &q po . p(7aC,M, =O) = & qy Po (1.13) 
p(7C, 3M, O) = 4492 , p(7C, =M, -O) = 4445 


Next we learn two items of information. First, we learn that O obtains. 
Assuming that the conditional independencies depicted in Figure 4 do not 
change, this means that we learn that 


p'(O) =a! (py pa + p92) +4! (4, p2 +9192) = 1, (1.14) 


where we have replaced all variables by the corresponding primed vari- 
ables. Second, we learn the conditional “if the Copernican model is true, 
then the outer planets will display retrograde motion when they are close 
to Earth”, which implies that 


p'(M|C) = pi = 1. (1.15) 
Inserting Equation (1.15) into Equation (1.14), we obtain 
a! ph +a! (qi ph +4) 42) = 1. (1.16) 
This equation only holds for a’ € (0,1), if 
p> =1 (1.17) 
and if 
moth =n+1 =1. (1.18) 
It has the solutions (i) qj = 1 and (ii) q; = 1. As solution (i) does not make 


sense (if the Copernican model is false, the retrograde motion pattern stays 
unexplained), we conclude that 


qo = 1. (1.19) 


Equations (1.17) and (1.19) make sure that p'(O) = 1. Inserting condi- 
tions (1.15), (1.17) and (1.19) into the analogues of equations (1.13), we can 
calculate the posterior probability distribution: 


pP(ICM,O)=a , op 
p(C,-M,O)=0 ,  p 
p'(-C,M,O)=a'g, =, ~~ p'(-C,M, 70) = 0 
p(aC,7M,O)=0'q',  , = p'(aC,7M,70) =0 (1.20) 


We can now show the following theorem: 
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Theorem 1.4 Consider the Bayesian Network in Figure 1.3 with the prior prob- 
ability distribution from Equation (1.13). Let 


P1 P2 
= = ae 1.21 
7 91 P2 +1 G2 ( ) 


We furthermore assume that 


(i) the posterior probability distribution p' is defined over the same Bayesian 
Network; 


(ii) the learned information constrains p' via Equations (1.14) and (1.15); 
(iii) p’ minimizes the Kullback-Leibler divergence to p. 
Then p'(C) > p(C) if and only ify > 1. 


That is, our rational degree of belief in the Copernican model is in- 
creased when the condition 7 > 1 holds. When will this be the case? 
This depends on how we flesh out the details of the historical story and 
the background assumptions of the astronomer. The prior degree of be- 
lief in the Copernican model might have been small back in the days since 
the Copernican model did not produce more accurate predictions than the 
Ptolemaic model and did not provide explanations for many physical phe- 
nomena, such as the movement of the Earth. Hence « is small. Moreover, 
the observed correlation between brightness and retrograde motion is njot 
explained by any alternative model which speaks for a small probability 
of 


p(O) =a (pi p2 + Pi qe) + & (qi p2 +91 92) (1.22) 


As « is small and @ is large, we conclude that € := qi p2 + 9192 must be 
small, too. 

From the story it is also clear that p2 = p(O|M) is fairly large: Given 
the postulated relation between a planet’s position and the pattern of ret- 
rograde motion, agreement between brightness and retrograde motion is 
to be expected. At the same time, q2 will be very small as there is no reason 
to assume such a striking agreement if planets do not display retrograde 
motion pattern when close to Earth. Finally, p; may not be very large, but 
the previous considerations suggest that p; >> e. We conclude that 


g= 3 py (1.23) 
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will typically be greater than 1. If 7 < 1, then the probability of N will not 
increase after learning the two pieces of information. 

We conclude that the proposed method yields the intuitively correct 
result in this case. Of course, the exact result will depend of the specific 
details of the story, but this appears to be a very sensible feature of our 
approach: we have already seen before that contextual factors may deter- 
mine whether learning a conditional raises or lowers the probability of the 
antecedens. 


The Economics Example 


Finally, we turn to the economics example. To represent the scenario, we 
introduce the following propositional variables. The variable R has the 
values R: “An economic recovery is going on”, and —=R: “No economic re- 
covery is going on”. The variable I has the values I: “Consumer income 
is increased”, and =]: “Consumer income is not increased”. The variable 
S has the values S: “The level of spending in warehouses is low”, and —S: 
“The level of spending in warehouses is high”. The Bayesian Network 
in Figure 5 represents the probabilistic dependencies and independencies 
between these variables, as well as their causal relations. Note that the 
Bayesian Network in Figure 5 has the same structure as the Bayesian Net- 
work in Figure 4. Our calculation therefore proceeds as in the previous 


oo -é6 


Figure 1.4: The Bayesian Network for the Economics Example. 


example. 


To complete the Bayesian Network, we have to fix the prior probability 
of R, ice. 


and the conditional probabilities 


pi := p(I|R) qi := p(I|-R) 
p2 := p(S|I) q2 := p(S|-I) 


We can now calculate the prior probability distribution over the vari- 


Variation 1: Learning Conditional Information 47 


ables R,I and S: 


,1,-S) =r pi p2 


P(R,LS)=rpip2 , p(R 
R,-L —S) =Prpiq2 


( 
p(R,-LS)=rpiga , pl 
p(-R,LS)=Faip2 , pl 

p(=R, aL S) = 179142 j p(-R, aL =S} = 794192 (1.25) 


=R,I, —S) =f qi pr 


Next we learn two items of information and our probability distribu- 
tion changes from p to p’. First, we learn that S obtains. Assuming that 
the causal structure depicted in Figure 5 does not change, this means that 
we learn that 


p'(S) =1' (p) po t+ p'9o) +1" (9) pa +9192) = 1, (1.26) 


where we have replaced all variables by the corresponding primed vari- 
ables. Second, we learn the conditional “if there is an economic recovery 


going on, consumers income is increased”, which implies that 
pd|R) = py = 1. (1.27) 
Inserting Equation (1.27) into Equation (1.26), we obtain: 
py tr (4 Po + 9,92) =1 (1.28) 
This equation only holds for r’ € (0,1), if 
ph =1 (1.29) 
and if 
MP+ N= n+ 4 =1. (1.30) 
It has the solutions (i) q = 1 and (ii) q, = 1. As solution (i) does not make 
sense—why should we be certain that consumer income is increased?—, 


we conclude that 
qh = 1. (1.31) 


Inserting conditions (1.27), (1.29) and (1.31) into the analogues of Equa- 
tion (1.25), we can calculate the posterior probability distribution: 
P(RALSi= 6 - op RSS) 0 
p'(R, aL 5) =0 P p'(R, I, =9) =0 
(SRS) SP a 2 op ERAS) 0 (1.32) 
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poRaLSarg. 6. -PeRalas 0 


The structure of the relevant Bayesian Network in the economics exam- 
ple is the same as in the previous astronomy example—see Figure 1.3 and 
1.4. Hence, we can apply Theorem 1.4. Whether or not the probability of 
the antecedens R is raised by learning the conditional depends on whether 


or not 


== PAR? 

9 P2492 
In this case, we have evidence to the contrary. It is clear from the story 
that q2 >> p2: the probability of low spending is higher for increased than 
for unchanged consumer income. Hence, 


Pipe _ AP p<. (1.33) 
gip2 + qip2 P2 


We conclude that the posterior probability of economic recovery is 
smaller than the prior probability. Hence, the proposed method again 
yields the intuitively correct result. 


Discussion 


Minimizing the Kullback-Leibler divergence between prior and posterior 
probability distribution, subject to a set of empirical constraints, is an in- 
teresting extension of Bayesian Conditionalization. First, it has a wider 
scope and allows for normatively attractive and computationally feasible 
learning of a wide range of constraints on the posterior probability dis- 
tribution, such as the mean or variance of a random variable. The ability 
to process such evidence is an important feature of any theory of scien- 
tific inference. Second, whenever we perform Bayesian Conditionaliza- 
tion on a first-order proposition, minimizing Kullback-Leibler divergence 
will deliver the same result. Learning by minimizing Kullback-Leibler di- 
vergence is thus a conservative extension of Bayesian Conditionalization 
which does not threaten core Bayesian principles. Rather, it enlarges the 
scope of Bayesian reasoning in science. Using the divergence minimiza- 
tion method, we can address and resolve challenges that have been put 
forward against Bayesian Conditionalization and Bayesian reasoning in 
general. 

This variation has applied divergence minimization to learning condi- 
tional information. We have focused on evidence in the form of indica- 


tive conditionals, e.g., that a certain manipulation reliably yields a certain 
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result, or that a scientific theory has certain observational consequences. 
Learning these conditionals is hard to represent in the ordinary Bayesian 
mechanism, as shown by Popper and Miller’s paradoxical results for learn- 
ing material conditionals. Without delving further into the epistemology 
of conditionals, we assume that learning a conditional H — E imposes 
a constraint on our posterior distribution p’: namely that the conditional 
probability of E, given H, is equal to one. If one would like to attack our 
work in this variation, one could either doubt that learning a conditional 
imposes the constraint p’(E|H) = 1 on the posterior distribution, or re- 
quire that learning the conditional implies more constraints. To us, none of 
these options look particularly appealing. 


The divergence minimization method can now be applied to minimiz- 
ing the divergence between p and p’, subject to the constraint p’(E|H) = 1. 
However, direct application of that method does not take account of the 
variety of the inferences that we make when learning a conditional: some- 
times the probability of the antecedens is raised, sometimes it is lowered, 
sometimes is stays equal. 


To deal with all three cases, we have suggested a refinement of the 
divergence minimization method that adequately deals with these cases: 
represent the causal and inferential relations among the involved propo- 
sitions by a Bayesian network with a set of conditional and unconditional 
independencies. These independencies act as constraints on both the prior 
and posterior distributions. After all, in the discussed examples, they con- 
cern elements of the background story and are not changed by learning 
the conditional. When Kullback-Leibler divergence is minimized subject 
to these constraints, the intuitively correct results follow. 


Does the proposed method also give the adequate results if more com- 
plicated scenarios are considered? We do not see a way how to answer 
this question in full generality. The set of possible scenarios where con- 
ditionals are learned is unrestricted, and one cannot do much apart from 
studying them case by case. We are, however, optimistic that the pro- 
posed method will work for more complicated scenarios involving more 
than three variables, as our examples represent diverse cases of probabilis- 
tic dependencies. The logic behind our approach is simple and intuitive: 
in moving from a prior to a posterior distribution, one should not only 
minimize the distance subject to novel evidence, but also subject to those 
constraints which do not change under the learning conditional informa- 
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tion. This concerns in particular the causal and inferential structure of the 
model, e.g., the variables which are affected by an intervention and the 
set of probabilistic independencies. Whenever the learned evidence does 
not change these relations, our model provides a general and adequate 
method of Bayesian updating. We conclude that the scope of evidence that 
Bayesian reasoners can model is wider than those captured by Bayesian 
Conditionalization (i.e., first-order propositions). This observation rebuts 
a large number of criticisms raised against the Bayesian research program 
in philosophy of science. 
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Proofs of the Theorems 


Auxiliary Lemmata 


The following three lemmata will be useful for the proofs presented in the 
remainder of this Appendix. 

Lemma 1: Let f(x) := log(ax),g(x) := x log(ax) and h(x) := 
X log(ax). Then the first derivatives are: f'(x) = 1/x,9/(x) = 1+ log(ax) 
and h'(x) = —1—log(ax). 


Proof: Trivial. 


Lemma 2: The function f(x) := xlog = + Xlog + has a minimum at x = 


igs 


Proof: Using Lemma 1, we obtain 


x 


f'(x) = log =: =] (1.34) 


Setting this expression equal to zero (i.e. the argument of the logarithm 
equal to 1), one obtains x = x’. As f(x) = 1/(xx) > 0 for all x € (0,1), 


we have indeed found the minimum. 


Lemma 3: Consider the equation x'/x'! = k-x/xX with k > 0. Then (i) 
x > x iffk > 1, (ii) x! =x iffk =1and (iii) x' < x iffk <1. 


Proof: This follows from the observation that the function g(x) := x/X 
is strictly monotonically increasing for x € (0,1). 


We now proceed to proving our main results. 


Proof of the Theorems 


Proof of Theorem 1.1: Let H be any closed sentence of L. The joint poste- 
rior probability distribution over H and E will have the following form: 


p(H,E)=h'p' —, ~—p'(H, 7E) = hp’ 
p(rH,E)=h'q' =, sp (AH, 7E) =h'q’, (1.35) 


where we have replaced all variables by the corresponding primed vari- 
ables. The constraint p’(E) = 1 and Equation (1.35) then entail that 


h'p' +h'q =1 (1.36) 
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and, taking into account that all four atoms in Equation (1.35) sum up to 
1, that 
Lp ag =0. (1.37) 


It is easy to see that Equation (1.37) only holds for all h’ € (0,1) if p’ = 
q’ = 1. In this case, Equation (1.36) is automatically fulfilled for all h’. The 
posterior probability distribution then simplifies as follows: 
p(H,E)=h"  , — p'(H, 4E) =0 
p'(aH;E) =h' , p' (7H, -E) =0 (1.38) 


To determine the value of h', we differentiate the Kullback-Leibler diver- 
gence (see Equation (1.5)) between p’ and p with respect to h’ and obtain 


a 
OD RE 2-65 (F ua 1) (1.39) 


after some algebra: 


oh! hn oh p 
To find the minimum, we set the latter expression equal to zero (i.e., we 


set the argument of the logarithm equal to 1) and obtain: 
h' = ee 
hp+hq 


In more familiar form, this can be written as 


p (H) = p(HIE), 


where the right hand side is the posterior probability distribution that 
follows from Bayesian Conditionalization. The posterior distribution ob- 
tained from minimizing Kullback-Leibler divergence subject to the con- 
straint p’(E) = 1 is equal to the distribution obtained by conditionalizing 
on E. To complete the proof, we convince ourselves that 


Dz 1 
ane ge 


(1.40) 


for all h’ € (0,1), which shows that we have indeed found the unique 
minimum of Dx, (p’||p). Hence, Bayesian Conditionalization follows from 
minimizing the Kullback-Leibler divergence between posterior and prior 
probability distribution, if one considers the learned information as a 


constraint on the posterior. 
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Proof of Theorem 1.2: The proof runs analogous to the above theorem. 
To find the minimum of Dxz(p’'||p), we first differentiate this expression 
with respect to q’ and obtain 


ODxt = ep q’ q 


Next, we set this expression equal to zero and obtain q’ = q. With this, we 
simplify Dx, and obtain 


/ 


he ft 1 
Dx. (p'||p) = (« log = +h’ log *| +h logs 


Next, we differentiate Dx, (p’||p) with respect to h’ and obtain 


ODxL hh i 
oh’ hi h p)- 
Setting this expression equal to zero yields 
eee 
wR 
and hence 
a 
hp +h 


Using Lemma 3, we conclude from Equation (1.5) that h’ < h, if 
Op 1s 


Proof of Theorem 1.3: With the constraint that 6 = p(A|E, -S) = 1, the 
prior probability distribution over A, E and S takes the following form: 


p(A,E,S) = aes p(7A,E,S) = aes 
p(A,E, -S) = es p(7A,E, =S) =0 
p(A, 7E,S) = yes p(7A, 7E,S) = Yes (1.41) 
p(A, aE, 7S) = des p(7A,7E, 3S) = des 


Taking into account the learned conditional p’(A|E) = 1 and the condition 
p'(S) = 0 that we derived from that equation, the posterior distribution 
looks as follows: 


p'(A,E,S) =0 p' (7A,E,S) =0 
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p'(A,E, 7S) =e p'(7A,E, 7S) =0 
p'(A, 7E,S) =0 p' (7A, 7E,8) =0 (1.42) 
p'(A, TE, -S) = oe! p' (7A, 7E, 8) = de! 


Calculating the Kullback-Leibler divergence between p’ and p, we obtain 


p'(A,E, S) 


Prolflle) = EAE S) bs Tae s)) 


ae ae\ (ae 

= elog( —) +26 log ( £ 78 log { — 
: 0g (55) +7 ve (55) +73 me (=) 
+ 


/ oH oh = é! 3 of 
= log +e log— +2 | slog — +9 log — 
e og 7 te ogste (: og t °8 5 


Next, we differentiate this expression with respect to e’ and 0’ and obtain 
dD KL e @ ! 6! = o 
— — ] aa f ] = 
Fpl log (5 | (: 85 + 6' log ; 


ODxL _ e | a8 
jh RNG Be 


Setting the expression in Equation (1.43) equal to zero, we obtain 


O50, (1.43) 
Substituting this result into Equation (1.43), we obtain 


we 
“KL = log (5 : “| (1.44) 


Setting the expression in Equation (1.44) equal to zero, we finally obtain 
e’ = e. To show that we have indeed found a minimum, we calculate the 


Hessian matrix of Dx at (e’,6’) = (e,6) and obtain 


H(Dx1)\es = ( a Heh i (1.45) 


This matrix is positive definite, which completes the proof of Theorem 
1.3. 


Proof of Theorem 1.4: With the prior probability distribution from 
Equation (1.13) and the posterior probability distribution from Equation 
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(1.20), we obtain for the Kullback-Leibler divergence between P’ and P: 


p'(C, M,O) 


Dxi(P'||P) := ‘(C, M,O) «lo Gaza 
KL( || ) on? ) 8 p(C, M,O) 


c = c’ g’ — erg 
= ¢'lo +c'q' lo = : +c'q', lo —— 1 
JG an e(a5%,) " «(sau 
phe GO ee c! Peet MP1 
= c log— +e log— te q, log a 


a qPip2 tl 
+g' lo = + lo 
1 «( =a )) ears 


Next, we calculate the first derivatives of Dx, (P’||P) with respect to c’ and 


1 ; q, T1492 
pean (eee | pik re Bat Ea 
=) qi-t98 ( 91 2 


/ —— 
oD xx = .¢ log dats aT 
oq; q, MP2 


gq, and obtain after some algebra: 


ODxL 1 ce 
7 ec C7, 


a1 


with 


ee (1.46) 
91 P2 + 41 92 


To minimize Dx ,(P’||P) we first set (1.46) equal to zero (noting that 
c’ € (0,1)) and obtain 


! 91 p2 
= a 1.47 
1 path aa 
With this, we simplify the expression in Equation (1.46) and obtain 
oD KL c! c (1 


Setting now also the expression in Equation (1.48) to zero, we obtain 


(1.49) 
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Using Lemma 3, we conclude that c’ > c iff ko > 1. This completes the 
proof of Theorem 1.4. (We skip the proof that the corresponding Hessian 
is positive definite if Equations (1.47) and (1.49) hold.) 
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Finally, it is interesting to see that conditionalizing on B and the mate- 
rial conditional C > M= —~C V M yields the same result in this case. 


CAOA(=CVM)) 


p(CIC D> M,O) = p(Cloa (1c vM)) = BORO 
_ pi(OACAM) _ p(O,C,M) 
p((OA-C)V(OAM)) — p(O,=C) + p(O,M) — p(O, -C,M) 
p(O,C,M) 


p(O, =C) + p(O,C,M) 


With the Bayesian Network depicted in Figure 4 and the prior proba- 
bility distribution from Equation (1.13), we then obtain 


C Pi p2 Chi cy ; 
= = = -=c=p(C). 
Cpip2+e(gip2t+[q2) cko+e p(C) 
(1.50) 


From this equation it is easy to see that P*(C) > p(C) iffko > 1. Hence, 
both procedures yield exactly the same result in this case. 


p(C|C > M,O) = 


Variation 2: Confirmation 


Confirmation of scientific theories by empirical evidence is a central ele- 
ment of scientific reasoning. Their acceptance and rejection is often based 
on the track record of experiments that confirmed or undermined them. 
Eddington’s observations of the 1919 solar eclipse confirmed Einstein’s 
General Theory of Relativity (GTR) and strongly contributed to the en- 
dorsement of GTR among theoretical physicists. Equally spectacularly, a 
huge set of observations by CERN researchers confirmed the existence of 
the Higgs Boson, a fundamental particle hypothesized in the 1960s. In eco- 
nomics, Maurice Allais and Daniel Ellsberg conducted experiments about 
decision-making under uncertainty that undermined the empirical basis 
of Rational Choice Theory. But what are the conditions when a piece of 
evidence confirms or undermines a theory? 

Philosophical accounts of confirmation answer this question by charac- 
terizing a confirmatory relationship between theory and evidence in log- 
ical or probabilistic terms. Such criteria facilitate the analysis and recon- 
struction of canonical confirmation cases in the history of science, and they 
also allow for a critical evaluation of experiments and observational stud- 
ies in modern science. As we will see in Variation 9 and 11, theories of 
confirmation also connect to hypothesis testing in science. 

The concept of confirmation also has numerous relations to other cen- 
tral topics of scientific reasoning. For example, Variations 6 and 7 expose 
substantial links between degree of confirmation, causal effect and ex- 
planatory power, following up on Carl G. Hempel’s (1965) postulate of a 
structural identity between explanation and prediction. In Variation 8, we 
show how establishing intertheoretic relations between different theories, 
e.g., Nagelian reduction, may confirm a theory and raise our confidence 
in it. 

Moreover, scientific reasoning can often be cast in confirmatory argu- 


ments. In Variation 4, we show how the failure to find satisfactory alterna- 
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tives may confirm a theory, even if there is no positive empirical evidence 
in its favor. Furthermore, Bayesian Confirmation Theory allows for a criti- 
cal analysis of the famous No Miracles Argument (NMA). That argument 
claims that the astonishing success of science in recent centuries indeed 
confirms the hypothesis that our best scientific theories genuinely refer 
and constitute knowledge of the world. More on this is said in Variation 
De 

The numerous references to later parts of this book make clear that 
confirmation is a basic concept in our work. Moreover, Bayesian Con- 
firmation Theory is the oldest and most worked-out branch of Bayesian 
philosophy of science. It provides an excellent case for motivating why 
the Bayesian calculus can elucidate scientific reasoning, in particular, how 
probabilistic accounts of confirmation can address longstanding puzzles 
about inductive inference (e.g., the tacking paradoxes, the grue paradox, 
and the paradox of the ravens). Therefore, we deal with this topic in quite 
some detail. We first explain the benefits of expressing confirmation in 
Bayesian terms (Section 2.1). Then we introduce different notions of con- 
firmation (firmness vs. increase in firmness—Section 2.2), and we examine 
the question whether there is a single best confirmation measure (Section 
2.3). In the end, we conclude that purely theoretical and conceptual rea- 
sons are not sufficient to determine a unique measure: different measures 
capture different senses of confirmation, and the choice between them may 
also be influenced by empirical and contextual factors (Section 2.4). 


Motivating Bayesian Confirmation Theory 


Probability is an extremely natural model for explicating degree of confir- 
mation. This has a number of reasons. 

First, probability is, as quipped by Cicero, “the guide to life”. Our 
decisions and actions are often based on which hypotheses are more prob- 
able than others: e.g., if there is a high chance of rain, we might cancel 
a planned beach trip. Confirmation is a guide to probability: better con- 
firmed hypotheses are, ceteris paribus, more probable than others. It is 
therefore natural to integrate confirmation and probability within a single 
mathematical formalism. 

Second, probability is the preferred tool for expressing uncertainty in 
science. Probability distributions are used for describing measurement er- 
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ror and for characterizing the “noise” in a system—the part of the data 
which cannot be explained by reference to natural laws. By phrasing 
confirmation in terms of probability, we connect a philosophical analy- 
sis of inductive inference to familiar scientific models where probabilistic 
expression of uncertainty already plays a dominant role (e.g., in linear 
regression models). 

Third, statistics, the science of analyzing and interpreting data and 
assessing theories on the basis of data, is formulated in terms of proba- 
bility theory. Statisticians have proved powerful mathematical results on 
the foundations of inductive inference, such as de Finetti’s famous repre- 
sentation theorem for subjective probability (de Finetti, 1974) or the con- 
vergence results by Blackwell and Dubins (1962) and Gaifman and Snir 
(1982). Probabilistic accounts of confirmation can directly make use of 
these results, leading to a beneficial interaction between philosophical and 
statistical work (e.g., Howson and Urbach, 2006; Good, 2009). For exam- 
ple, the widespread practice of null hypothesis significance tests (NHST) 
can be fruitfully reviewed from the standpoint of a probabilistic theory of 
confirmation (Royall, 1997; Sprenger, 2016b). 

These considerations explain why philosophy of science has paid so 
much attention to probabilistic confirmation theories. Among those theo- 
ries, Bayesian Confirmation Theory is the most prominent representative. 
We shall now describe this approach. 


Confirmation as (Increase in) Firmness 


We remember from the introduction that Bayesians represent subjective 
degrees of belief by means of a probability function. The basic idea of 
Bayesian Confirmation Theory is that confirmation judgments are func- 
tions of an agent’s conditional and unconditional degrees of belief. At 
first sight, this may appear unpalatably subjective. Two things should 
be noted, though: First, agents are assumed to be rational: their degrees 
of belief conform to the axioms of probability, take into account relevant 
evidence, etc. Second, even when the posterior degree of belief in a hy- 
pothesis differs among rational agents, they could still agree on the degree 
of confirmation that the evidence delivers. 

We now engage in a Bayesian explication of degree of confirmation 
and assume that it only depends on the joint probability distribution of 
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the hypothesis H and the evidence E. More precisely, we assume that E 
and H are in the set of propositions £ of a propositional language L that 
describes our domain of interest. A Bayesian confirmation measure is a 
function c : L? x 8 — R, where ¥ is the set of probability measures on 
the algebra generated by £ that models the agent’s degrees of belief. This 
function assigns a real number c(H,E) to any pair of propositions (H, E) 
together with a probability function p € $8—a number that is interpreted 
as the degree to which E confirms H. Reference to the probability measure 
p is omitted as a matter of convenience and in agreement with conven- 
tions in the literature. Similarly, we omit reference to specific background 
assumptions K in the confirmation relation and just assume that they are 
shared among all rational agents. 

The classical method for explicating degree of confirmation is to spec- 
ify adequacy conditions for the concept and to derive representation the- 
orems for various confirmation measure. Such theorems characterize the 
set of measures (or possibly the unique measure) that satisfy these con- 
straints. This approach allows for a sharp demarcation and mathemati- 
cally rigorous characterization of the explicandum, and at the same time 
for critical discussion of the explicatum, by means of defending and criti- 
cizing the properties which are encapsulated in the adequacy conditions. 

For example, a central function of a measure of degree of confirmation 
is to establish a bridge between qualitative and quantitative theories of 


confirmation: 


Qualitative-Quantitative Bridge Principle for Confirmation For any 
propositions H, E € £ and probability measure p € $8 and a 
confirmation measure c : £2 x $8 + R, there is a real number t € R 
such that 


e E confirms/supports H if and only if c(H,E) > t; 
e E undermines/disconfirms H if and only if c(H,E) < t 
e E is confirmationally neutral/irrelevant to H if and only if 


c(H,E) =t. 


In other words, a measure of degree of confirmation should guide our 
qualitative confirmation in the sense that there is a numerical threshold 
for telling positive confirmation from disconfirmation (Carnap, 1950, 463). 
As a matter of convenience, we often drop quantification over the propo- 
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sitions H, E € £ and the probability measure p € $B, following Crupi 
(2013). 

As already explained, Bayesian Confirmation Theory phrases degree of 
confirmation in terms of probabilistic dependencies between hypothesis H 
and evidence E. The following adequacy condition makes this approach 
explicit and contributes to a more precise description of the confirmation 
measure (e.g., Crupi, 2013): 


Formality c(H,E) is a measurable function from the joint probability dis- 
tribution over H and E to a real number c(H,E) € R. In partic- 
ular, there exists a function f : [0,1]> — IR such that c(H,E) = 


f (pC AE), p(H), p(E)). 


Since the three probabilities p(H / E), p(H), p(E) suffice to determine the 
joint probability distribution of H and E, we can express c(H, E) as a func- 
tion of these three arguments. In other words, Formality creates the com- 
mon ground on which the various confirmation measures compete. 

Another cornerstone for Bayesian explications of confirmation is the 
following principle: 


Final Probability Incrementality For any propositions H, E, and E’ € £ 
with probability measure p € 5B, 


c(H,E) >c(H,E’)  ifandonlyif  p(HJE) > p(HIE’). (2.1) 


According to this principle, E confirms H more than E’ does if and only if 
it raises the probability of H to a higher level. It is easy to show that Final 
Probability Incrementality also implies that c(H,E) = c(H,E’) if and only 
if p(H|E) = p(H|E’). Given the basic intuition that degree of confirmation 
should co-vary with boost in degree of belief, satisfactory Bayesian expli- 
cations of degree of confirmation should arguably satisfy this condition. 

There are now two main roads for adding more conditions, which will 
ultimately lead us to two different explications of confirmation: confirma- 
tion as firmness of belief and as confirmation as increase in firmness 
(Carnap, 1950). The latter is often called the incremental concept of con- 
firmation or confirmation as evidential support. 

We begin with confirmation as firmness. Consider the football stand- 
ings from Table 2.1. Three teams in the Italian Seria A, AS Roma, FC Inter- 
nazionale (“Inter”), and Juventus (“Juve”) are competing for the scudetto, 
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Rank | Team Points | Team Points 

after 36 out of 38 rounds | after 37 out of 38 rounds 
1 Roma 78 | Inter 79 
2 Inter 76 | Roma 78 
3 Juve 74 | Juve 74 


Table 2.1: A motivating example for Conditional Equivalence. Top of the 
Seria A after 36 and 37 out of 38 rounds, respectively. 


the national soccer championship. On the penultimate match day, Inter 
beats Juve in the Derby d'Italia while Roma loses to another team. Call this 
conjunction of propositions E. Let H = Inter will win the championship 
and H’ = Roma will be the runner-up. Given E, H and H’ are logically 
equivalent in the sense that we can derive one from the other given E. It 
is now very natural to claim that E confirms H and H’ to an equal degree. 
This intuition is expressed in the following adequacy condition: 


Local Equivalence If H and H’ are logically equivalent given E (i.e., 
EAH EH’,EAH’ — 9), then c(H,E) = c(H’,E). 


In other words, E confirms the hypotheses H and H’ to an equal degree if 
they are indistinguishable conditional on E. 

If we buy into this intuition, Local Equivalence allows for a power- 
ful representation theorem by Michael Schippers (2016): all confirmation 
measures that satisfy Formality, Final Probability Incrementality, and Lo- 
cal Equivalence are non-decreasing functions of the posterior probability 
p(HIE). 


Theorem 2.1 (Confirmation as Firmness, Schippers 2016) Formality,  Fi- 
nal Probability Incrementality and Local Equivalence hold if and only if there 
is a non-decreasing function g : [0,1] — IR such that for any H,E € £ and any 


p © $B, c(H,E) = g(p(HIE)). 


On the account of confirmation as firmness, scientific hypotheses count 
as well-confirmed whenever they are sufficiently probable, that is, when 
p(H|E) exceeds a certain (possibly context-relative) threshold. This also 
corresponds to Carnap’s concept of probability, or “degree of confirma- 
tion” in his system of inductive logic (Carnap, 1950). 

All confirmation measures that satisfy the three above conditions are 
ordinally equivalent, that is, they can be mapped onto each other by 
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means of a non-decreasing function. In particular, their confirmation rank- 
ings agree: if there are two functions g and 9g’ that satisfy Theorem 2.1, 
with associated confirmation measures c and c’, then c(H,E) > c(H’, E’) 
if and only if c’(H,E) > c’(H’,E’). Since confirmation as firmness is non- 
decreasing in p(H|E), it follows from the Qualitative-Quantitative Bridge 
Principle that E confirms H if and only if p(H|E) > ¢ for some—possibly 
context-dependent—t € [0,1]. 

The account of confirmation as firmness dispels some problems that 
have plagued their predecessors, and in particular qualitative accounts of 
confirmation. Among them is the idea of hypothetico-deductive confir- 
mation, (where hypotheses are confirmed if they predict a phenomenon), 
which is explicated by a deductive entailment relation between hypothesis 
and evidence. This approach looks very natural: it aligns with Popper’s 
view of scientific reasoning consisting of conjectures and refutations, and 


also with William Whewell’s earlier view that 


our hypotheses ought to foretel phenomena which have not yet 
been observed ...the truth and accuracy of these predictions 
were a proof that the hypothesis was valuable and, at least to a 
great extent, true. (Whewell, 1847, 62-63) 


However, H-D confirmation directly runs into the paradox of tacking by 
conjunctions: If E confirms H (because H | E), then E confirms also H 
/ X, for an arbitrary hypothesis X, even if it stems from a completely dif- 
ferent domain of science and is completely irrelevant for E. This is clearly 
too permissive since confirmation is allowed to spread in an uncontrolled 
way. The tacking paradox is therefore regarded as a major blow for the 
hypothetico-deductive approach to confirmation, notwithstanding recent 
solution attempts (Schurz, 1991; Gemes, 1993, 1998; Sprenger, 2011, 2013a). 

The Bayesian account of confirmation as firmness naturally dissolves 
the tacking paradox. For any irrelevant X, it will be the case that 
p(HAX|E) < p(H|E). Theorem 2.1 then tells us that there exists a non- 
decreasing function g that maps the conditional probability of a hypothesis 


to its degree of confirmation. Hence, we can infer 
c(HAX,E) = g(p(HA X|E)) < g(p(HIE)) = c(HE), (2.2) 


demonstrating that the conjunction is confirmed to a lower degree than the 
original hypothesis H (especially so for an unlikely, far-fetched proposition 
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X). Confirmation as firmness gives the intuitively correct response to the 
tacking by conjunction paradox. It does not deny that H / X is confirmed 
as well—after all, H is still obviously relevant for E—but the paradox is 
mitigated by decreasing the amount of confirmation. 


On the other hand, confirmation as firmness does not always agree 
with the use of that concept in scientific reasoning. To be sure, relative to 
the totality of observed evidence, we would call a theory well-confirmed 
if and only if it is sufficiently probable, conditional on the evidence. But 
often, scientists are interested in whether a certain experiment supports 
or corroborates a hypothesis—independent of whether we find it probable 
that the hypothesis is true. It is essential for confirmatory evidence to 
provide a good reason for believing a theory, even if the theory is, all 
things considered, unlikely. 


For instance, in the first years after Einstein invented the General The- 
ory of Relativity (GTR), many scientists did not have a particularly high 
degree of belief in GTR because of its counterintuitive nature. However, 
it was agreed upon that GTR was well-confirmed by its predictive and 
explanatory successes, such as the bending of starlight by the sun and 
the explanation of the Mercury perihelion shift (Earman, 1992). The ac- 
count of confirmation as firmness fails to capture this intuition. The same 
holds true for statistical analysis of experiments in modern science, where 
the dominant frequentist paradigm does not allow for assigning a prob- 
ability to the tested hypothesis. Instead, the confirmatory strength of the 
evidence is evaluated on the basis of whether the results are statistically 
significant and give reason to reject the tested hypothesis in favor of an al- 
ternative. Moreover, on confirmation as firmness, E could confirm H even 
if it lowers the probability of H, as long as p(HI|E) is still large enough. 
But few people would call an experiment where the results undermine H 


a confirmation of H. 


In a now classical debate in philosophy of science, Karl R. Popper 
(1954, 2002) raised these points against Carnap: degree of confirmation 
cannot be (posterior) probability. As a reaction, Carnap distinguished two 
concepts of confirmation in the second edition (1962) of “Logical Foun- 
dations of Probability”: confirmation as firmness and confirmation as in- 
crease in firmness. This brings us to the following natural definition that 
provides a more precise condition for the relation between qualitative con- 
firmation judgments and probabilistic relevance: 
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Confirmation as Increase in Firmness For any propositions H, E € £ 
with probability measure p € 5B, 


1. Evidence E confirms/supports hypothesis H if and only if 
p(HIE) > p(X). 

2. Evidence E disconfirms/undermines hypothesis H if and only 
if p(HIE) < p(H). 

3. Evidence E is neutral with respect to H if and only if p(H|E) = 
p(H). 


In other words, E confirms H if and only if E raises our degree of belief 
in H. Such explications of confirmation are also called statistical relevance 
accounts of confirmation because the neutral point is determined by the 
statistical independence of H and E. They measure the evidential sup- 
port that H receives from E. The increase in firmness explication of confir- 
mation receives empirical support from findings by Tentori et al. (2007a): 
ordinary people use the concept of confirmation in a way which can be 
dissociated from posterior probability and that is strongly correlated with 
measures of evidential support. In the remaining variations, we will stan- 
dardly use increase in firmness, or evidential support, when modeling the 
confirmation of scientific hypotheses and theories. 

Confirmation as increase in firmness has interesting relations to qual- 
itative accounts of confirmation and the paradoxes we have encountered. 
For instance, hypothetico-deductive confirmation emerges as a special 
case: if H entails E and p(E) < 1, then p(E|H) = 1 and by Bayes’ The- 
orem, p(H|E) > p(H). We will also show that confirmation as increase 
in firmness can address the tacking by conjunction paradox. But first, we 
will demonstrate how confirmation as increase in firmness handles the 
longstanding paradox of the ravens. 

Let H = Vx : Rx — Bx stand for the hypothesis that all ravens are 
black. H is equivalent to the hypothesis H’ = Vx : ~Bx — —Rx that no 
non-black object is a raven. It is highly intuitive that logically equivalent 
hypotheses are confirmed or disconfirmed to the same degree; nothing in 
a formal theory of confirmation should depend on the particular formula- 
tion of the hypothesis. Hence, anything that confirms H also confirms H’ 
and vice versa (Nicod, 1961). It is also intuitive that universal conditionals 
such as “all ravens are black” are confirmed by their instances, i.e., black 
ravens. However, as Hempel (1945a, 1945b) observed, the conjunction of 
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W, | Wy 
Black ravens 100 1,000 
Non-black ravens 0 1 


Other birds 1,000,000 | 1,000,000 


Table 2.2: I.J. Good’s (1967) counterexample to the paradox of the ravens. 


both principles leads to paradoxical results. A black raven is an instance 
of H and confirms the raven hypothesis. A white shoe is an instance of 
H’ and confirms the hypothesis that non-black objects are not ravens. But 
because of the aforementioned equivalence condition, the white shoe also 
confirms the hypothesis that all ravens are black! This result is known as 
the paradox of the ravens, or alternatively, as Hempel’s paradox. 

The account of confirmation as firmness allows us to spot what is 
wrong with instance confirmation and thereby resolves the paradox. 
While that intuition is certainly valid for some background assumptions, it 
is not valid for all possible situations. I.J. Good (1967) constructed a simple 
counterexample: Assume that there are only two possible worlds, W; and 
W2, whose properties are described by Table 2.2. 

In this scenario, the raven hypothesis H is true whenever W, is the 
case, and false whenever W> is the case. Moreover, the observation of a 
black raven is evidence that W> is the case and therefore evidence that not 
all ravens are black: 


100 1,000 
p(Ra.BalWi) = 7 G99 400 < Lo01001 = 


p(Ra.Ba|W2). 


By an application of Bayes’ Theorem, we infer p(W,|Ra.Ba) < p(W,) and 
p(H|Ra.Ba) < p(H). This shows that universal conditionals are not always 
confirmed by their instances. We see how the explication of confirmation 
as increase in firmness corrects our pre-theoretic intuitions regarding the 
theory-evidence relation. The raven paradox may thus be resolved by 
rejecting one of its assumptions, namely the idea that instances always 
confirm a hypothesis. 

The raven paradox is threefold, however: apart from the (qualitative) 
question whether or not the observation of a white shoe confirms the raven 
hypothesis, there is also the comparative paradox: does the observation 
of a black raven confirm the raven hypothesis to a higher degree than the 
observation of a white shoe? Fitelson and Hawthorne (2011) show in their 
Theorem 2 that this is indeed the case if plausible assumptions on the 
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real world are made: p(H|Ra.Ba) < p(H|=Ra.—Ba). By Final Probability 
Incrementality, this implies that Ra.Ba confirms H more than —Ra.—Ba 
does. This shows, ultimately, why we consider a black raven to be more 
important evidence for the raven hypothesis than a white shoe. In the light 
of these results, the paradox loses its bite. 

Confirmation as increase in firmness also addresses another notorious 
paradox, Nelson Goodman’s (1955) new riddle of induction. The name 
notwithstanding, it is not meant as a general charge on inductive reason- 
ing, but on a particularly plausible view of inductive inference: namely 
that one and the same evidence cannot confirm two hypotheses whose 
predictions contradict each other. 

As Goodman shows, this principle disagrees with what most people 
would classify as a justified inductive inference. Consider, for example, 
the following case: 


Observation: emerald e, is green. 
Observation: emerald e2 is green. 


Generalization: All emeralds are green. 


This seems to be a perfect example of a valid inductive inference. Now 
define the predicate “grue”, which applies to all green objects if they were 
observed for the first time prior to time tf = “now”, and to all blue objects 
if they are observed later. (This is just a description of the extension of the 
predicate—no object is supposed to change color.) The following inductive 
inference satisfies the same logical scheme as the previous one: 


Observation: emerald e, is grue. 
Observation: emerald e2 is grue. 


Generalization: All emeralds are grue. 


In spite of the gerrymandered nature of the “grue” predicate, the infer- 
ence is sound: it satisfies the basic scheme of enumerative induction, and 
the premises are undoubtedly true. But then, it is paradoxical that two 
valid inductive inferences support flatly opposite conclusions. The first 
generalization predicts that emeralds observed in the future are green, the 
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second generalization predicts them to be blue. How do we escape from 
this dilemma? 


One may propose that in virtue of its gerrymandered nature, the 
predicate “grue” should not enter inductive inferences. Goodman notes, 
however, that it is perfectly possible to re-define the standard predicates 
“green” and “blue” in terms of “grue” and its conjugate predicate “bleen” 
(=blue if observed prior to t, else green). Hence, any preference for the 
“natural” predicates and the “natural” inductive inference seems to be ar- 
bitrary, or at least conditional on the choice of a specific language. So 


another move is required. 


The explication of confirmation as increase in firmness immediately 
comes up with an answer: both hypotheses (the “green” and the “grue” 
hypothesis) should count as confirmed. We need to abandon the idea that 
evidence cannot confirm incompatible hypotheses. If they share content 
that is confirmed by the current experiment, they are both supported by it. 
For example, Einstein’s work on the photoelectric effect raised our degree 
of belief in the hypothesis that electromagnetic radiation can be divided 
into a finite number of quanta, and thereby also in different versions of 
quantum theory—e.g., those that were compatible with relativity theory 
and those that weren’t. 


How do confirmation judgments inform our predictions for future ob- 
servations? Generalizing Goodman’s green/grue argument, it seems that 
any prediction for the color of the next observed emerald seems equally 
reasonable. However, from a Bayesian point of view, this is only true if 
all hypotheses (that is, the green, grue, gred, etc. hypotheses) are equally 
probable at the time of observation. In practice, this will usually not be the 
case: some hypotheses have higher plausibility than others. Prior probabil- 
ities act as a counterweight to Goodman’s paradox and guide our predic- 
tions when the observations cannot distinguish between two hypotheses. 
And of course, the grue hypothesis is much more implausible than the 
green hypothesis. Note that this choice cannot be based on Bayesian rea- 
soning: they have to come from theoretical principles, past track record, 
coherence with other parts of science, and the vague reasoning faculty that 
we call scientific judgment. Bayesian confirmation theory explains how to 
amalgamate prior degree of belief with observed evidence, but it does not 
tell you which prior degrees of belief are reasonable. In this sense, Good- 


man shows a general problem for formal reasoning about confirmation 
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and evidence: there is no viable premise-independent theory of inductive 
support (see also Norton, 2016). 

The three showcases above—the tacking by conjunction paradox, the 
paradox of the ravens and Goodman’s new riddle of induction—make 
clear that Bayesian Confirmation Theory can successfully address long- 
standing puzzles in inductive reasoning. However, there is one question 
we have evaded so far, and now we shall turn to it: how can we measure 
confirmation as increase in firmness, or alternatively, how should eviden- 


tial support be quantified? 


The Plurality of Bayesian Confirmation Measures 


For scientists who want to report the results of their experiments, quanti- 
fying the strength of the observed evidence is an urgent and challenging 
question. It is also crucial for giving a Bayesian answer to the Duhem- 
Quine problem (Duhem, 1914). If an experiment fails and we ask our- 
selves which hypothesis to reject, the degree of (dis)confirmation of the 
involved hypotheses can be used to evaluate their standing. Unlike purely 
qualitative accounts of confirmation, a measure of evidential support can 
indicate which hypothesis we should discard. For this reason, the search 
for a proper confirmation measure is more than a technical exercise: it 
is of a vital importance for distributing praise and blame between differ- 
ent hypotheses that bear on an observation. Such assessments may also 
be sensitive to the used measure, highlighting the need for characterizing 
their mathematical properties and comparing them on a normative basis 
(Eells and Fitelson, 2000, 2002; Fitelson, 1999, 2001a,b). 

Table 2.3 gives a survey of measures that are frequently discussed 
in the literature. We have normalized them such that for each measure 
c(H,E), confirmation amounts to c(H,E) > 0, neutrality to c(H,E) = 0 
and disconfirmation to c(H,E) < 0. This allows for a better comparison of 
the measures and their properties. 

Evidently, these measures all have quite distinct logical and epistemo- 
logical properties. It makes thus sense to apply the methodology that 
we used for confirmation as firmness, and to characterize them in terms 
of representation theorems where, as before, Formality and Final Proba- 
bility Incrementality will serve as minimal reasonable constraints on any 


measure of evidential support. Notably, Final Probability Incremental- 
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Difference Measure d(H, E) = p(H|E) — p(H) 
Log-Ratio Measure r(H,E) = log ee) 
Log-Likelihood Measure 1(H, E) = log ies 

E|H)—p(E|-H 
Kemeny-Oppenheim Mea- k(H,E) = poe 
sure 

PCHIE)—p(A) ig H(HIE) > (HI 

Generalized Entailment Mea- z(H,E) = nC ay) ; pes 
a aay oie p(HIE) < p(H 


Christensen-Joyce Measure s(H,E) = p(HJE) — p(H|7E) 
Carnap’s Relevance Measure _c'(H, E) = p(H AE) — p(H)p(E) 


i = p(-HIE) 
Rips Measure r'(H,E) =1—- on 


Table 2.3: A list of popular measures of evidential support. 


ity already rules out two of the measures in the list, namely Carnap’s 
relevance measure c’(H,E) = p(HAE) — p(H)p(E) and the Christensen- 
Joyce measure s(H,E) = p(H|E) — p(H|-E) Christensen (1999). Carnap’s 
relevance measure is also problematic because it relies on the symmetry 
c(H,E) = c(E,H), in other words, E confirms H as much as H confirms E. 
Many intuitive confirmation judgments violate this equality. For example, 
knowing that a specific card, in the deck, is the ace of spades confirms 
the hypothesis that this card is a spade much stronger than the other way 
round. The same problem affects the (log-)ratio measure r(H, E) (Eells and 
Fitelson, 2002). 

We will not discuss all representation results, but we present, pars pro 
toto, four specific conditions that figure in different representation theo- 
rems and that resurface at later points of the book, too. The first condition 
is the 


Law of Likelihood 


c(H,E) > c(H’,E) ifand only if — p(E|H) > p(E|H’). 


This condition has a long history of discussion in philosophy and statistics. 
The idea is that E favors H over H’ if and only if the likelihood of H on E is 
greater than the likelihood of H’ on E (Hacking, 1965; Edwards, 1972; Roy- 
all, 1997; Sober, 2008). In other words, E is more expected under H than 
under H’. Law of Likelihood is also at the basis of the likelihoodist theory 
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of confirmation, which regards confirmation as a comparative relation be- 
tween two competing hypotheses and refuses to make a straightforward 
judgment on how much E confirms H. 

The second condition demands that conditioning on E’ does not affect 
the confirmation relation between H and E’, as long as E’ is sufficiently 
independent from E and H: 


Modularity If E L E’!H (that is, p(E| +HAE’) = p(E|+H)), then 
c(H,E) = cip/(H,E) where cig denotes confirmation relative to the 
probability distribution conditional on E’. 


That is, if E’ does not affect the likelihoods that H and =H have on E, 
then conditioning on E—now supposedly irrelevant evidence—does not 
alter the degree of confirmation (Heckerman, 1988; Crupi et al., 2013). The 
intuition behind Modularity is that probabilistically irrelevant information 
should not alter a judgment of degree of confirmation. 

A third condition concerns the question of how the confirmation of 
hypothesis H by evidence E relates to the confirmation of the disjunction 
H V H’ by the same evidence. The idea is that the logical weakening of a 
hypothesis contributes to the confirmation of the compound if and only if 
the added disjunct is confirmed by the evidence. 


Disjunction of Alternative Hypotheses Assume that H and H’ are incon- 
sistent with each other. Then, c(H,E) > c(H V H’,E) if and only if E 
confirms H’ as well (that is, p(H’|E) > p(H’)). Analogous conditions 
hold for c(H,E) = c(H V H’,E) and c(H,E) < c(H V H’,E). 


Finally, the fourth condition is inspired by the analogy between de- 
ductive and inductive logic: confirmation is viewed as a generalization 
of logical entailment to uncertain reasoning (Crupi et al., 2007; Crupi and 
Tentori, 2013). Degree of confirmation should therefore display the sym- 
metry that contraposition expresses for logical entailment: if E confirms 
H, then —H also confirms —E, and the two degrees of confirmation are the 
same. Similarly, disconfirmation is, like logical inconsistency, modeled as 


a symmetrical relation 


Contraposition/Commutativity If E confirms H, then c(H,E) = 
c(7E, =H); and if E disconfirms H, then c(H, E) = c(E,H). 


Combined with Formality and Final Probability Incrementality, each of 
these four principles gives rise to a representation theorem that singles out 
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a particular measure (for the theorems and proofs, see Heckerman, 1988; 
Crupi et al., 2013; Crupi, 2013): 


Theorem 2.2 (Representation Theorems for Confirmation Measures) 1. 
If Formality, Final Probability Incrementality and Law of Likelihood hold, 
then there is a non-decreasing function g such that c(H,E) = g(r(H,E)). 


2. If Formality, Final Probability Incrementality and Modularity hold, then 
there are non-decreasing functions g and g' such that c(H,E) = g(I(H,E)) 
and c(H,E) = g’(k(H,E)). Note that k and | are ordinally equivalent. 


3. If Formality, Final Probability Incrementality and Disjunction of Alterna- 
tive Hypotheses hold, then there is a non-decreasing function g such that 
(HE) = g(d(H,E)). 


4, If Formality, Final Probability Incrementality and Commutativity hold, 
then there is a non-decreasing function g such that c(H,E) = g(z(H,E)). 


It should also be noted that the Bayes factor, a popular measure of evi- 
dential support in Bayesian statistics (Kass and Raftery, 1995; Goodman, 
1999b), falls under the scope of the theorem. For mutually exclusive hy- 
potheses Hp and Hj, and evidence E, the Bayes factor in favor of Ho is 
defined as 

Brith ESO) PE) = PLE) 


~ p(HLIE) p(Ho) — p(E|H1) 
It is not difficult to see that this quantity is ordinally equivalent to the log- 


likelihood measure / and the Kemeny-Oppenheim measure k (Kemeny 
and Oppenheim, 1952) when Hp and H; exhaust the space of hypotheses: 
just substitute H and —=H for Ho and Hj. 

To underine that the difference between the various confirmation mea- 


sures has substantial philosophical ramifications, let us go back to the 
problem of irrelevant conjunctions. If we analyze this problem in terms of 
the ratio measure r, then we obtain, assuming H | E, that for an “irrele- 


vant” conjunct H’, 


r((HAH’,E) = p(HAH'E)/p(HAH’) = p(EIHAH’)/p(E) 
= 1/p(E) = p(E/H)/p(E) 
= 1r(H,E) 


such that the irrelevant conjunction is supported to the same degree as 
the original hypothesis. This consequence is certainly unacceptable as a 
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judgment of evidential support since H’ could literally be any hypothesis 
unrelated to the evidence, e.g., “the star Sirius is a giant light bulb”. In 
addition, the result does not only hold for the special case of deductive 
entailment: it holds whenever the likelihoods of H and H A H’ on E are 
the same, that is, p(E|H \ H’) = p(E|H). 


The other measures fare better in this respect: whenever p(E|H ( H’) = 


p(E|H), all other measures in Theorem 2.2 reach the conclusion that 
c(HAH’,E) < c(H,E) (Hawthorne and Fitelson, 2004). In this way, we 
can see how Bayesian Confirmation Theory improves on H-D confirma- 
tion and other qualitative accounts of confirmation: the paradox is ac- 
knowledged, but at the same time, it is demonstrated how the paradox 
can be mitigated. 


Discussion 


Which of the remaining confirmation measures should be preferred? This 
is a difficult question that probably cannot be resolved on purely theo- 
retical grounds. The adequacy conditions in the representation theorems 
have quite divergent motivations, and a straightforward comparison is un- 
likely to lead to conclusive results. For example, it has been shown that 
no confirmation measure satisfies the following two conditions: (i) degree 
of confirmation is maximal if E implies H; (ii) the a priori informativity 
(cashed out in terms of predictive content and improbability) of a hypoth- 
esis contributes to degree of confirmation (Bréssel, 2013, 389-390). See 
also Huber (2005). Both conditions are intuitively plausible: (i) captures 
the idea that degree of confirmation generalizes logical entailment, (ii) the 
idea that hypotheses with informative predictions should be rewarded. 
But we have to choose, and our choice will depend on what we value in 
scientific reasoning. 

The idea that there is “the one true measure of confirmation” (Milne, 
1996) is therefore problematic. We may abandon such a confirmational 
monism in favor of pluralism (Fitelson, 1999, 2001b): we accept that there 
are different senses of degree of confirmation that correspond to different 
explications. For example, d strikes us as a natural explication of increase 
in subjective confidence, z generalizes deductive entailment, and / and k 
measure the discriminatory force of the evidence regarding H and —H. 


Ultimately, the choice between the measures may also depend on em- 
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pirical findings. Crupi et al. (2007) and Tentori et al. (2007a) compare 
different confirmation measures in an experiment where white and black 
balls are drawn from an urn and the participants must assess the confir- 
mation of different hypotheses about the composition of balls in the urn. 
Their results favor the z-measure, followed by the /-measure, whereas the 
difference measure d is at the bottom of the list. In a similar vein, a recent 
experiment by Colombo et al. (2016a) has pointed out that judgments of 
confirmation are enhanced by the prior plausibility of a hypothesis if the 
probabilistic relevance relations are held constant. This phenomenon, con- 
sistent with the findings of Crupi et al. (2007), is also called the Matthew 
effect: “For unto every one that hath shall be given, and he shall have 
abundance: but from him that hath not shall be taken even that which 
he hath.” (Matthew 25:29). For Bayesian confirmation measures, this 
means that measures which do not assign a ceteris paribus bonus to log- 
ically stronger and more informative hypotheses are probably more in 
line with our empirical confirmation judgments (Festa, 2012; Roche, 2014). 
If there is hope for confirmational monism, it might come from empirical 
research on confirmation judgments, showing that participants share the 


motivation behind a specific measure. 


We have seen that Bayesian Confirmation Theory yields many inter- 
esting results in philosophy of science. But it is also a research paradigm 
that connects well to scientific disciplines: Bayesian reasoning has sparked 
interest among experimental psychologists and connects well to cutting- 
edge research on human cognition (e.g., Oaksford and Chater, 2000; Doya 
et al., 2007; Douven, 2016). There is a large number of interdisciplinary 
papers on probabilistic reasoning, where both cognitive scientists and 
philosophers have been involved (e.g., Tentori et al., 2007b; Crupi et al., 
2008; Zhao et al., 2012). But also on the theoretical side, there is ample 
room for future research. Questions that are just about to be explored in- 
clude an analysis of confirmation measures in information-theoretic terms 
(Crupi and Tentori, 2014) and the use of confirmation measures for ana- 
lyzing the diagnostic value of a medical test (Crupi et al., 2009). Especially 
the latter question, which deals with designing medical tests that lead to 
a high amount of (dis)confirmation upon revealing the results, strikes us 
as an exciting combination of Bayesian philosophy of science with clinical 
practice. 


In spite of all these cross-disciplinary connections, one criticism of 
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Bayesian Confirmation Theory has been levelled again and again: that 
it misrepresents actual scientific reasoning. As evidence for this claim, the 
Problem of Old Evidence is often cited (Glymour, 1980; Bréssel and Huber, 
2015). The next variation responds to these worries. 
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Variation 3: The Problem of Old Evi- 
dence 


The Problem of Old Evidence—modeling how already known evidence 
confirms a scientific theory—is one of the most troubling and persistent 
challenges for Bayesian Confirmation Theory. The most famous case of 
confirmation by old evidence might be the Mercury perihelion shift (Gly- 
mour, 1980; Earman, 1992). For a long time, this phenomenon could not 
be fully explained by Newtonian mechanics or any other reputable phys- 
ical theory. Then, Einstein realized that his General Theory of Relativity 
(GTR) accounted for the perihelion shift. According to most physicists, 
this discovery conferred a substantial degree of confirmation on GTR, per- 
haps even more than some pieces of novel evidence, such as Eddington’s 
1919 solar eclipse observations. Also in other scientific disciplines, newly 
introduced theories are commonly assessed with respect to their success 
at explaining away observational anomalies. Think, for example, of the as- 
sessment of global climate models against a track record of historical data, 
or of economic theories that try to explain anomalies in decision-making 


under uncertainty, such as the Allais or Ellsberg paradoxes. 


We can extract a general scheme from these examples. A phenomenon 
E is unexplained by the available scientific theories. At some point, it 
is discovered that theory T accounts for E. E is old evidence: at the time 
when this relationship is developed, the scientist is already certain or close 
to certain that the phenomenon E is real. Indeed, in the GTR example, 
astronomers were collecting data on the Mercury perihelion shift for many 
decades. Nevertheless, E apparently confirms T because it resolves a well- 
known and persistent observational anomaly. How does this fit into the 


Bayesian view of confirmation? 


The relevant sense of confirmation in this example is not firmness, but 


increase in firmness: confirming evidence raises a rational agent’s confi- 
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dence in the theory. E confirms T if and only if p(T|E) > p(T). These two 
probabilities are related by means of Bayes’ Theorem: 
p(E|T) 

p(E) 


Wher E is an old evidence and already known to the scientist, her degree 


p(T|E) = p(T) 


of belief in E is maximal: p(E) = 1. With that assumption, it follows that 
the probability of T conditional on E cannot be greater than the uncondi- 
tional probability: 


p(TIE) = p(T) - p(ElT)/p(E) = p(T) - p(EIT) < p(T) (3.1) 


Hence, E does not confirm T in the sense of increasing the firmness of 
our belief in T. The very idea of confirmation by old evidence, or equiva- 
lently, confirmation by accounting for well-known observational anoma- 
lies, seems impossible to describe the Bayesian belief kinematics. This is 
the Problem of Old Evidence (POE). The problem may also be phrased 
differently, as exposing that the Bayesian cannot account how the dis- 
covery of explanatory relations between theory and evidence increases 
the epistemic standing of the theory. Notably, the problem does not al- 
low for an easy fix by making assumptions on p(T) and the likelihoods 
p(E|+T): as long as 0 < p(T) < 1, the law of total probability implies 
that p(E|T) = p(E|-T) = 1 if p(E) = 1. Hence, T fails to be confirmed by 
E. 

The POE has different aspects, as worked out by Ellery Eells (1985, 
1990). First, there is the static (Eells: “ahistorical”) POE: belief changes 
induced by discovering T or an explanatory relationship between T and E 
have already taken place. Still we would like to say that E is evidentially 
relevant for T: when faced with a decision between T and a competitor T’, 
E is a good reason for preferring T over T’. But there is also the dynamic 
(Eells: “historical”) POE: it refers to the moment in time where T and its 
relation to E are discovered. Why does the discovery that T accounts for E 
raise our confidence in T? How can the discovery of an explanatory success 
be confirmationally relevant? (see also Romeijn and Wenmackers, 2016). 
In other words, the dynamic POE deals with the question of confirmation 
relative to actual degrees of belief, which is not necessarily the case for the 
static POE. 

This variation develops a new solution proposals for both the static and 
the dynamic POE. Section 3.1 comments on solutions of the dynamic POE 
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proposed by Garber (1983), Jeffrey (1983), Niiniluoto (1983) and Earman 
(1992). On these accounts, confirmation occurs through conditionalizing 
on the proposition that T implies E. Section 3.2 presents an improvement 
on this approach. Section 3.3 analyses the static POE while Section 3.4 
explains our solution of that problem. We conclude with a brief discussion 
in Section 3.5 and by giving the proofs of our results in Section 3.6. 


The Dynamic Problem of Old Evidence: 
The GJN Approach 


The dynamic Problem of Old Evidence is concerned with how learning a 
deductive or an explanatory consequence of a theory can raise our confi- 
dence in that theory. An example from history may help to get this clear. 
In classical examples, such as explaining the Mercury perihelion shift, the 
newly invented theory (here: GTR) was initially not known to entail the 
old evidence. It took Einstein some time to find out that T entailed E 
(Brush, 1989; Earman, 1992). By learning the relationship X = TF E, Ein- 
stein increased his confidence in T since such a strong consilience of theory 
and data could not be expected beforehand. Thus, the inequality 


p(T|X,E) > p(TIE) (3.2) 


seems to be a plausible representation of Einstein’s degrees of belief before 
and after making the discovery that GTR explained the perihelion shift of 
Mercury. Consequently, the relevant piece of evidence is not E itself, but 
the learning of a specific relation between theory and evidence, namely 
that T implies, accounts for or explains E. 

However, such belief change is hard to model in a Bayesian framework. 
A Bayesian reasoner is assumed to be logically omniscient and the logical 
fact X = T F E should always have been known. Hence, X cannot be 
properly learned by a Bayesian: it is, and has always been, part of her 
background beliefs. 

To solve this problem, several philosophers have relaxed the assump- 
tion of logical omniscience and enriched the set of propositions about 
which agents have degrees of belief. New atomic sentences of the form 
X = TF E are added (Garber, 1983; Jeffrey, 1983; Niiniluoto, 1983), such 
that Bayesian Confirmation Theory can account for our cognitive limita- 
tions in deductive reasoning. Then, it can be shown that under suitable 


assumptions, conditionalizing on X confirms T. 
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The first models along these lines have been developed by Daniel Gar- 
ber, Richard Jeffrey and Ilkka Niiniluoto in a group of papers which all 
appeared in 1983. Henceforth, we will refer to the family of their solu- 
tion proposals as the GJN solutions. In order to properly compare our 
own solution proposals to the state-of-the-art, and to assess their innova- 
tive value, we will briefly recap the achievements of the GJN models and 
elaborate their limitations and weaknesses. 

All GJN models aim to show that conditionalizing on the proposition 
X increases the posterior probability of T. Eells (1990, 211) distinguishes 
three steps in this endeavor: First, parting with the logical omniscience 
assumption and developing a formal framework for imperfect Bayesian 
reasoning. Second, describing which kind of relation obtains between T 
and E. Third, showing that learning this relation increases the probability 
of T. While the GJN models neglect the second step, probably in due an- 
ticipation of the diversity of logical and explanatory relations in science, 
they are quite explicit on Step 1 and Step 3. 

Garber’s model focuses on Step 1 and on learning logical truths and ex- 
planatory relations in a Bayesian framework (Garber, 1983). For instance, 
learning logical/mathematical truths can be quite insightful and lead to 
great progress in science. The famous, incredibly complex proof of Fer- 
mat’s Last Theorem may be a good example . Garber therefore enriches the 
underlying language L in a way that the proposition of the meta-language 
X is one of the atomic propositions of the extended language L’. 

Garber also demands that the agent recognize some elementary rela- 
tions in which X stands to other elements of L’: 


p(E|T,X) =1 p(T, E,X) = p(T,X). (3.3) 


These constraints are an equivalent of modus ponens for a logic of degree 
of belief: conditional on T and X, the agent should be certain that E. In 
other words, if an agent takes T and X for granted, then she also believes 
E to maximal degree. Knowledge of such elementary inference schemes 
sounds eminently sensible when we are trying to model the boundedly ra- 
tional reasoning of a scientist. Garber then proves the following theorem: 
there is at least one probability function on L’ such that every non-trivial 
atomic sentence of the form X gets a value strictly between 0 and 1. Thus, 
one can coherently have a genuinely uncertain attitude about all proposi- 
tions in the logical universe, including tautologies. Finally Garber shows 
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that there are infinitely many probability functions such that p(E) = 1 and 
p(I|X,E) > p(T|E). A similar point is, though with less formal detail and 
rigor, made by Niiniluoto (1983). 

While Garber’s efforts are admirable, they only address the first step 
of solving the dynamic POE: he provides an existence proof for a solution 
to the POE, but he does not show that learning X confirms T for most 
plausible probability distributions over E, T and X. Also Niiniluoto (1983) 
only sketches a solution idea without filling in the details. This lacuna is 
closed by Richard Jeffrey (1983), who published his solution in the same 
volume where Garber’s paper appeared. 

Jeffrey considers the meta-proposition X as an object of subjective un- 
certainty, but he keeps formalities down to the standard level of Bayesian 
Confirmation Theory. Then he makes the following assumptions, using 
the notational convention X’ := TF 7E: 


(a) p(E) =1. 

(B) p(T), p(X), p(X’) € (0,1). 
(7) p(X,X') = 0. 

(6) p(T|X VX") > p(T). 

(7) p(T, 4E,X") = p(T’). 


From these assumptions, Jeffrey derives p(T|X,E) > p(T,E), that is, the 
solution to the dynamic POE. 

The strength of Jeffrey’s solution crucially depends on how well we can 
motivate condition (6). The other conditions are plausible: (a) is just the 
standard presumption that at the time where confirmation takes place, E is 
already known to the agent. (8) demands that we may not be certain about 
the truth of T or T | +E beforehand, in line with the typical description 
of the POE. (7) requires that T do not entail E and —E at the same time. 
Finally, (7) is a Modus Ponens condition similar to (3.3): the joint degree 
of belief in T, sE and X’ is equal to the joint degree of belief in T and X’, 
demanding that the agent recognize that the latter two propositions entail 
aE. 

Hence, (6) really carries the burden of Jeffrey’s argument. This con- 
dition has some odd technical consequences, as pointed out by Earman 
(1992, 127). For instance, with plausible additional assumptions, we can 
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derive p(T|X) > 2p(T) which implies that the prior degree of belief p(T) 
must have been smaller than .5. Jeffrey’s solution of the dynamic POE 
does not apply to theories that were already quite probable, and this is an 
awkward feature. 

That said, the real problem with (6) is not technical, but philosophical. 
Jeffrey (1983, 148-149) supports (6) by mentioning that Newton was, upon 
formulating his theory of gravitation G, convinced that it would bear on 
the phenomena he was interested in, namely the mechanism governing 
the tides. Although Newton did not know whether G would entail the 
phenomena associated to the tides or be inconsistent with them, he used 
his knowledge that G would bear on the tides as a reason for accepting it 
as a working hypothesis. 

To our mind, this reconstruction conflates an evidential virtue of a the- 
ory with a methodological one. Theories of which we know that they make 
precise predictions on an interesting subject matter are worthy of further 
elaboration and pursuit, even if the content of their predictions is not yet 
known. This is basically a Popperian rationale for scientific inquiry: go for 
theories that have high empirical content, that make precise predictions, 
and develop them further. They are the ones that will finally help us to 
solve urgent scientific problems. Newton may have followed this method- 
ological rule when discovering that his theory of gravitation would have 
some implications for the tides phenomena. Making such pragmatic ac- 
ceptances, however, does not entail a commitment to the thesis that the 
plausiblity of a theory increases with its empirical content. Actually, Pop- 
per (2002, 268-269) thought the other way round: theories with high em- 
pirical content rule out more states of the world and will have low proba- 
bility! This is just because they take, in the virtue of making many predic- 
tions, a higher risk of being falsified. Indeed, it is hard to understand why 
increasing the empirical content of T provides an argument that T is more 
likely to be true. Increasing the class of potential falsifiers of T should 
not increase its plausibility. Jeffrey’s condition (0) is therefore ill-grounded 
and at the very least too controversial to act as a premise in a solution of 
the POE. 

Earman (1992, 128-129) considers two alternative derivations of 
p(T|X,E) > p(T|E) where assumptions different from Jeffrey’s (0) carry 
the burden of the argument. One of them is the inequality 


(p) p(T|X) > p(T|>X, >’). 
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but it is questionable whether this suffices to circumvent the above objec- 
tions. What Earman demands here is very close to what is supposed to 
be shown: that learning T | E is more favorable to T than learning that T 
gives no definite prediction for the occurrence of E or 7E. In the light of the 
above arguments against (6) and in the absence of independent arguments 
in favor of (), this condition just seems to beg the question. 

The second alternative derivation of p(T|X) > p(T) relies on the equal- 


ity 
(p) p(XVX’) =1. 


However, as Earman admits himself, this condition is too strong: it 
amounts to demanding that upon formulating T, the scientist was certain 
that it either implied E or 7E. In practice, such relationships are rather 
discovered gradually. As Earman continues, discussing the case of GTR: 


the historical evidence goes against this supposition: [...] Ein- 
stein’s published paper on the perihelion anomaly contained 
an incomplete explanation, since, as he himself noted, he had 
no proof that the solution of the field equations [...] was the 
unique solution for the relevant set of boundary conditions 
(Earman, 1992, 129) 


Taking stock, we conclude that Garber, Jeffrey, Niiniluoto and Earman 
make interesting proposals for solving the dynamic Problem of Old Ev- 
idence, but that their solutions are either incomplete or based on highly 
problematic assumptions. We will now show how their approach to the 
dynamic POE can be improved. 


Solving the Dynamic Problem of Old Evidence: 
Alternative Explanations 


A problem with the traditional GJN approaches is that they require con- 
straints on degrees of belief (e.g., Jeffrey’s (0) or Earman’s ()) that are 
either implausibly strong or too close to the desired confirmation-theoretic 
conclusion p(T|X,E) > p(T|E) itself. To remedy this defect, we propose to 
take into account whether alternatives to T adequately explain E. Let the 
propositions X and Y be defined as follows: 


oX2T adequately explains (or accounts for) E. 
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© Y © some alternative to T (=T’) adequately explains (or accounts for) E. 


Now, consider the following four ordinal constraints on the degrees of 
belief of a rational Bayesian agent: 


p(T|X, =Y) > p(T|AX, -Y) (3.4) 
p(T|X, =Y) > p(T|AX,Y) (3.5) 
p(T|X,Y) > p(T|>X, Y) (3.6) 
p(T|X, Y) = p(T|>X, sY) (3.7) 


Let’s examine each of these four constraints in turn, assuming that E is 
either a certainty or very probable. Suppose that —Y is the case and that 
no alternative to T adequately explains E. Then, (3.4) asserts that T is more 
probable given X (=T explains E) than given —X (=T does not explain E). 
Judgments of evidential relevance translate into judgments of evidential 
support, if there is no alternative to explain E. 

(3.5) is an even less controversial variant of the same proposition. If 
T is the only explanation of E, T is more probable than if it does not ex- 
plain E and there is at least one good alternative. In other words, (3.4) 
and (3.5) say that T’s being the only adequate explanation of E confers a 
greater probability on T than any possibility which implies that T does not 
adequately explain E. These two constraints strike us as pretty uncontro- 
versial. 

Constraint (3.6) also seems very plausible. If there are alternatives to 
T that adequately explain E, T is more plausible if it explains E than if 
it doesn’t. This constraint mirrors the reasoning in (3.4) for the case that 
there are alternatives to T. 

The fourth and final inequality (3.7) says that T is at least as probable, 
given the supposition that both T and some alternative scientific theory 
adequately explain E (i.e., given X A Y) as it is given the supposition that 
no scientific theory adequately explains E (i.e., given —X A “Y). It might 
even be compelling to rank p(T|X, Y) strictly higher in one’s comparative 
confidence ranking than p(T|=X,—7Y). After all, X \ Y implies that T ad- 
equately explains old evidence E, whereas =X /\ =Y implies that T does 
not adequately explain E. On the other hand, one might also reasonably 
maintain that both suppositions place T and its alternatives on a par with 
respect to explaining E, and so they shouldn’t confer different probabili- 
ties on T. Both of these positions are compatible with (3.7). The only thing 
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(3.7) rules out is the claim that T is more probable given E’s inexplicability 
(=X A =Y) than it is given E’s multiple explicability by both T and some 
alternative to T (X A Y). As such, (3.7) also seems eminently reasonable. 

Now, the desired conclusion (3.1) follows from (3.4)-(3.7). To be pre- 
cise, we can prove the following general result (see also Fitelson and Hart- 
mann, 2016) 


Theorem 3.1 For propositions T,E,X,Y € £ with probability measure p € YB, 
let 0 < p(T) <1. Then conditions (3.4)-(3.7) jointly entail p(T|X) > p(T). 


Of course, this result also applies to the case where we have already 
conditionalized on E and p(E) = 1: 


Corollary 3.1 For propositions T,E,X,Y € L with probability measure p € YB, 
let 0 < p(T) <1. Then the analogues of conditions (3.4)—(3.7) for the probability 
distribution p(-|E) jointly entail p(T|X,E) > p(T|E). 

This approach has the following three distinct advantages over the tra- 
ditional GJN approaches. 


(i) Our approach does not require the assumption that p(E) = 1. It 
may be true that our constraints (3.4)-(3.7) are most plausible given 
the background assumption that E is known with certainty. But, we 
think (3.4)-(3.7) retain enough of their plausibility, given only the 
weaker assumption that E is known with near certainty (ie., p(E) © 
1). 


(ii) Our approach only rests on the ordinal constraints (3.4)—(3.7), not on 
judgments of degrees of confirmation. 


(iii) Our approach is not restricted to cases in which T (and/or alter- 
natives T’) explain E in a deductive-nomological way. That is, our 
approach covers all cases in which scientists come to learn that 
their theory adequately explains E, not only those cases in which 
scientists learn that their theory entails E (or explains E deductive- 
nomologically). 


A slight disadvantage of this approach is that conditions (3.4)-(3.7) are 
themselves phrased as confirmation judgments. That is, the solution of 
the Problem of Old Evidence (whether X confirms T) depends on which 
truth-functional combinations of X and Y confirm T. Such judgments may 
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be considered to be too close to what is supposed to be shown. We think 
that our assumptions are plausible enough to withstand this criticism, but 
we also present an alternative solution where the condtions are phrased in 
terms of the likelihoods of T and X on E. But first, we move on to the static 
POE. 


The Static Problem of Old Evidence: 
A Counterfactual Perspective 


The static Problem of Old Evidence is concerned with describing why old 
evidence E is evidentially relevant for theory T (Eells, 1985). The rela- 
tion of evidential relevance is supposed to be independent of the moment 
when the evidence was observed, when a relationship between theory and 
evidence was discovered, and so on. It corresponds to the question “Why 
is E at all—in the present as well as in the future—a reason for preferring T 
over its competitors?” By definition, this question is hard to answer in the 
Bayesian framework, which is in the first place a theory of confirmation as 
change in degree of belief. 

Christensen (1999) contends that the choice of a particular confirma- 
tion measure may help us to resolve the static POE. Take the measure 
s*(T,E) = p(T|E) — p(T|7E). If T entails E, as in the GTR example, then 
7E also entails =T, which implies p(T|=E) = 0 and s*(T,E) = p(T|E) > 0. 
Hence E confirms E. According to s*, old evidence E can substantially 
confirm theory T whereas the degree of confirmation is zero for mea- 
sures that compare the prior and posterior probability of T, such as 
d(T,E) = p(T\E) — p(T) or r(T,E) = log(p(T\E)/p(T)). Choosing the 
“right” confirmation measure therefore resolves the POE. 

This approach strikes us as problematic. First, it is questionable 
whether s* is a good explicatum for degree of confirmation. In Varia- 
tion 2, we have argued that s* fails to satisfy important adequacy criteria 
for degree of confirmation, such as Final Probability Incrementality. Also 
in general, the measure sensitivity of Christensen’s proposal is somewhat 
awkward: the challenge posed by the POE consists in showing that E raises 
the probability of T. Inequality (3.1) does not target the degree of confir- 
mation conferred by old evidence. Therefore, a solution that depends on 
the choice of a particular confirmation measure is less general than we 
desire. 
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Second, Christensen’s move has its merits for cases where p(E) is 
close to, but not entirely equal to one. But in the classical POE where 
p(E) = 1, p(T|AE) may not have a clear-cut definition since p(T|=E) = 
p(T A =E)/p(-E) involves a division by zero. We could solve this prob- 
lem by evaluating p(T|7E) not via the Ratio Analysis of conditional prob- 
ability, but as a counterfactual degree of belief: suppose that =E were the 
case, how likely would T be? But then, Christensen’s solution proposal is 
more than an appeal to a particular confirmation measure: it requires a 
specific approach to conditional degree of belief which needs to be spelled 
out in more detail. 

Such an attempt is made by Colin Howson (1984, 1985, 1991). He gives 
up the Bayesian explication of confirmation as positive probabilistic rele- 
vance relative to actual degrees of belief. Rather, he suggests to evaluate 
the confirmation relation with respect to a counterfactual degree of belief 
function where E is not taken for granted: 


[T]he Bayesian assesses the contemporary support E gives T by 
how much the agent would change his odds on T were he now to 
come to know E [...] In other words, the theory is explicitly a 
theory of dispositional properties of the agent’s belief-structure 
[...]. (Howson, 1984, 246, original emphasis) 


According to this account, conditional probabilities such as p(E|T) and 
p(E|=T) should not be understood as our actual degree of belief in E sup- 
posing T or —T: this would be equal to one since E is already known and 
the equation 


1 = p(E) = p(E|T)p(T) + p(E|-T)p(-T) 


would imply that also p(E|T) = p(E|=T) = 1. Hence, p(-|T) must be a 
different belief function: it describes those degrees of belief that we would 
have in E if we knew nothing about E and T were the case. If T is a sta- 
tistical hypothesis, then we just add T hypothetically to our background 
knowledge and calculate the probability of E conditional on this assump- 
tion. For example, we could evaluate the probability of two heads and 
three tails in five ii.d. tosses of a fair coin (T: 8 = 0.5) and a biased coin 
(T’: 6 = 0.6) and infer p(E|T) *% 0.31 and p(E|T’) & 0.23. Similarly, in 
the GTR example, we could conclude that p(E|T) = 1 because GTR im- 
plies the Mercury perihelion shift, whereas p(E|T’) < 1 for Newtonian 
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mechanics and other theories that do not make definite predictions about 
E. In general, the probability distributions p(-|T) and p(-|T’) describe our 
conditional degrees of belief in pieces of evidence, supposing that T or T’. 
In such a setting, we can meaningfully compare p(E|T) and p(E|T’) which 
is not possible if we interpret these probabilities as our actual degrees of 
belief in E, knowing that T (e.g., via Ratio Analysis). 

How does this formalism translate into confirmation judgments? If p 
is an “impartial” prior probability distribution, that is, p(T) = p(T’), then 
we infer that p(T|E) > p(T’|E) if and only if p(E|T) > p(E|T’) (proof 
omitted). Final Probability Incrementality then implies that E confirms T 
more than it confirms T’. That is, our judgment on the conditional proba- 
bility of E given T, relative to minimal background knowledge, translates 
into a judgment of evidential support if the priors do not favor one of 
the hypotheses. In our interpretation, the static POE abstracts away from 
background knowledge at a particular point in history, and therefore, the 
counterfactual approach to conditional probability is an adequate tool to 
tackle it. We will know show how learning the proposition X = T | E 
raises the probability of T relative to a counterfactual probability function 
as described above. That is, we solve the dynamic POE in a framework 
that is typical of the static POE. 


Solving the Hybrid Problem of Old Evidence: 
Learning Explanatory Relationships 


The aim of this section consists in showing that learning explanatory or 
deductive relationships between theory and evidence can raise our de- 
gree of belief in the theory. In other words, learning X “ (T adequately 
explains E) raises the subjective probability of T relative to a probability 
function where E is taken for granted. This looks like a formulation of 
the dynamic POE, but things are more subtle: we are not interested in 
whether X confirmed T for the scientist who discovered X (and relative 
to her degrees of belief), but in whether X should confirm T for all scien- 
tists in the community, irrespective of their actual degrees of belief. This 
second question is related to the static POE where scientific confirmation 
is independent of anybody’s subjective degrees of belief at a particular 
time. Hence, our question in this section—the confirmatory impact of 
explanatory discoveries—needs to be phrased relative to a counterfactual 
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probability function, like in the static POE. That’s why we would like to 
call it the hybrid POE. 

What kind of probability function should be chosen? In our view, 
p(-|T) should represent the degrees of belief of a scientist who has a 
sound understanding of theoretical principles and their impact on obser- 
vational data, conditional on the assumption that T or —T is the case (see 
also Earman, 1992, 134). Such degrees of belief are required for making 
routine judgments in assessing evidence and reviewing journal articles: 
How probable would the actual evidence E be if T were true? How proba- 
ble would E be if T were false? When T and —T are two definite statistical 
hypotheses, like in Howson’s coin toss example, such judgments are im- 
mediately given by the corresponding sampling distribution. But even 
in more general contexts, such judgments may be straightforward, or a 
matter of consensus in the scientific community. 

We now formulate constraints on an agent’s conditional degrees of 
belief in the hybrid POE. The first condition characterizes the elementary 
inferential relations between E, T and X: 


[1] p(E|T,X) =1 


If T is true and T entails E, then E can be regarded as a certainty. In 
this scenario, X codifies a strict deductive relation between T and E. Later, 
we will relax this condition in order to cover more general explannatory 
dependencies. 

To motivate the second constraint, note that learning the predictions 
of a refuted hypothesis is irrelevant to our assessment of the plausibility of 
the predicted events. For instance, the astrological theory on which Nos- 
tradamus based his predictions is in all probability wrong. Upon learning 
the content of his predictions (e.g., the third World War starting in 2048), 
we should neither raise nor lower our credence in the events that his theory 


predicted to happen. This motivates the equation p(E|—=T, X) = p(E|=T), 


or written differently, p(E|—T, X) = p(E|=T, 7X). Again, these degrees of 
belief ought to be interpreted in the neo-Ramseyian, counterfactual sense: 
supposing that T has been disproved, would learning something about 
the predictions of T affect our confidence in the occurrence of E? Plausi- 
bly not since T has ceased to be relevant for empirical forecasts. Hence 
we demand that if ~T is already known, then learning X or —X does not 
change the probability of E. However, E should still be possible if T were 
false. Hence: 
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[2] p(EIST,X) = p(EIT, 5X) > 0 


Finally, we have the following inequality: 


[3] 
1— p(XX|=T) p(XIT) 


P(EIT, “X) < F—S0gn) p(Xxi-7) 


This condition demands that the value of p(E|T, =X) be smaller than the 
threshold on the right hand side. When X and T are positively relevant to 
each other or probabilistically independent, [3] is trivially satisfied since 
in that case, p(X|T) > p(X|=T), implying that the right hand side of [3] 
is greater or equal than one. But also if X and T are negatively relevant 
to each other, [3] is plausibly satisfied. When the mutual negative impact 
of X and T is not too strong, the two quotients in [3] are close to 1, and 
the inequality will be satisfied as long as p(E|—X,T) is not too close to 1 
itself. Given that T is assumed to be true, but that by —X, it does not fully 
account for E, E should be far from certain for a rational Bayesian agent. 
Here it is essential that the conditional probabilities are interpreted in the 
counterfactual sense. Otherwise, we would always obtain p(E|-) = 1 for 
old evidence E, regardless of which proposition stands to the right of the 
vertical dash. In the (plausible) case of independence of T and X, this 
would contradict [3] (p(E|T, =X) < 1). 

Together with the unproblematic assumption that neither T nor —T is 
a certainty beforehand (0 < p(T) < 1), these three conditions are jointly 
sufficient to prove that X confirms T relative to E. 


Theorem 3.2 Let T, X, and E be three propositions of £ with probability measure 
p € Band 0 < p(T) < 1. Let the following three conditions be satisfied: 


[1] 
p(E|T,X) =1; 
[2] 
p(E|AT, X) = p(E|-T, +X) > 0; 
[3] 


1— p(X|=T) p(XIT) 


P(EIT, “X) < [Sogn ax) 


Then, X confirms T relative to (old evidence) E; that is, p(T\E,X) > p(TIE). 
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In other words, if E is taken for granted, learning X raises the condi- 
tional degree of belief in T if conditions [1]-[3], whose adequacy we have 
justified above, are accepted. Or in other words: if we knew little or noth- 
ing about the observational history of a discipline, and we were informed 
that E, then discovering X would raise our confidence in T. This seems to 
be a perfectly reasonable sense in which X is evidence for T, relative to E. 

This theorem solves the hybrid POE based on a combination of strate- 
gies. The main idea stems from the GJN models—the confirming event is 
the discovery that T accounts for/explains E—, but the relevant constraints 
are spelled out in terms of conditional degrees of belief which are inter- 
preted in a counterfactual sense, like in the static POE. Then, with the help 
of Bayes’ Theorems, the constraints transfer to bounds on the conditional 
probability of T given E and X. 

In many cases of scientific reasoning, the condition p(E|T,X) = 1 may 
be too strong. It may apply well to the Mercury perihelion shift, which 
is deductively implied by GTR, but it may fail to cover cases where T ac- 
counts for E in a less rigorous manner (Earman, 1992, 121)—see also Fitel- 
son (2015). If we allow for a weaker interpretation of X, e.g., as providing 
some explanatory mechanism, then we are faced with the possibility that 
even if we are certain that T is true, and that T explains E, the conditional 
degree of belief in E may not be a certainty. p(E|T) < 1 could even make 
sense if the relationships between T and E are deductive: the proof of X 
could so complex that the involved scientists have some doubts about its 
soundness and refrain from assigning it maximal degree of belief. Again, 
Fermat’s Last Theorem may be a plausible intuition pump. 

For covering this case, we prove another theorem which covers the case 
of p(E|T, X) = 1 —e for some small ¢ > 0. 


Theorem 3.3 Let T, X, and E be three propositions of £ with probability measure 
p € Band 0 < p(T) < 1. Let the following three conditions be satisfied: 


[1’] 
p(E|T,X) =1—e forsome0<e<1; 


[2°] 


p(E|“T, X) = p(E|-T, =X) > 0 


[3"] 
1~ p(Xl=T) p(XlT) 
1— p(X|T) p(X|-T) 


p(E|T, -X) < (1) 
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Then, X confirms T relative to (old evidence) E; that is, p(T|E,X) > p(TIE). 


The motivations and justifications for the above assumptions are the 
same like in Theorem 3.2. [1’] just accounts for lack of full certainty about 
the old evidence, and [2’] is identical to [2]. Moreover, condition [3] of 
Theorem 3.2 can, with the same line of reasoning, be extended to condition 
[3’] in Theorem 3.3. [3’] sharpens [3] by a factor of 1 — ¢, but leaves the 
qualitative argument for [3] intact. As long as p(E|T,7>X) and p(E|T, X) 
decrease by roughly the same margin, the result of Theorem 3.2 transfers 
to Theorem 3.3. 

Thus, we can extend the novel solution of POE to the case of resid- 
ual uncertainty about the old evidence E—a case that is highly relevant 
for case studies in the history of science. If we compare this solution of 
the POE to Jeffrey’s and Earman’s proposals, we note that our assump- 
tions [1], [2] and [3] are silent on whether Jeffrey’s (6)—or Earman’s (¢) 
and (wp), for that matter—is true or false. For a proof with the help of 
Branden Fitelson’s PrSAT package for Mathematica (Fitelson, 2008), see 
Sprenger (2015). Hence we can discard Jeffrey’s dubious assumption (0) 
that increasing empirical content makes a theory more plausible, without 
jeopardizing our own results. 

We have thus provided a solution of the POE that successfully tackles 
a hybrid version of the POE. Notably, our solution makes less demanding 
assumptions than Jeffrey and Earman. Conceptually, however, this solu- 
tion is anchored in the static POE and in the use of counterfactual (rather 
than actual) probability functions. We now discuss the repercussions of 
our results on the general debate about POE, and the role of Bayesian 


Confirmation Theory in scientific reasoning. 


Discussion 


This variation has analyzed Bayesian attempts to solve the Problem of 
Old Evidence (POE), and it has proposed two new solutions. We have 
started with a distinction between the static and the dynamic aspect of 
the problem. Simplifying a bit, we can say that the static POE deals 
with the question of providing an account of conditional probability 
where p(E|T) > p(E|=T) for old evidence E, demonstrating the evi- 
dential relevance of E for T. The dynamic problem, on the other hand, 
deals with the challenge to provide reasonable constraints on p such that 
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p(I|X,E) > p(T|E), with X denoting the proposition that T implies or 
explains E (TF E). 

We first presented our criticism of existing solutions to the dynamic 
POE in the footsteps of Garber, Jeffrey and Niiniluoto (GJN). Our first 
model of the dynamic POE was based on judgments when T is confirmed 
by the presence and absence of alternative hypotheses that could account 
for old evidence E. The second model was based on constraints on the 
conditional degrees of belief p(E| + T, +X). 


To avoid that these degrees of belief are equal to one because E is old 
evidence, we interpreted p as a properly counterfactual degree of belief 
function and not as describing our actual degrees of belief. Indeed, in 
scientific practice, we typically interpret p(E|T) and p(E|=T) as principled 
statements about the predictive import of +T on E, without referring to 
our complete observational record. Such judgments are part and parcel 
of scientific reasoning, e.g., in statistical inference, where theories T, T’, 
etc. impose definite probability distributions on the observations, and our 
degrees of belief p(E|T), p(E|T’), etc. follow suit. However, the standard 
account of conditional degree of belief (Ratio Analysis) does not have this 
property. We therefore suggested that the counterfactual account of con- 
ditional degree of belief presented in the introduction can contribute to 
solving the (static and dynamic) POE. 


It is also worth mentioning that our treatment of the POE allows for a 
distinction between theories that have been constructed to explain the old 
evidence E and those that explain E surprisingly (like Einstein’s GTR). In 
the first case, we would not speak about proper confirmation. Indeed, if 
we accommodate the parameter values of a general theory T such that it 
explains the old evidence E, whatever this evidence turns out to be, then 
X is actually a certainty: p(X) = 1. This is because T has been designed 
to explain E. As a consequence, p(T|E,X) = p(T\E) and X fails to confirm 
T. Whereas in the case of a surprising discovery of an explanatory relation 
between T and E, p(X) < 1. The degree of confirmation that X confers on 
T gets the bigger the more surprising X is—in line with our intuitive judg- 
ment that surprising explanations have special cognitive value. In gen- 
eral, there seem to be strong parallels between the POE and the prediction 
vs. accommodation debate in philosophy of science (e.g., Hitchcock and 
Sober, 2004). Future research could investigate this relationship in more 
detail, and also come up with case studies about scientific reasoning with 
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old evidence, enabling a better evaluation of our solution proposals. Other 
research projects that spring to mind are an integration of the POE with 
explanatory reasoning in science (— Variation 7) and providing a solution 
of the POE in terms of learning conditionals. After all, the dynamic POE 
can be described as learning the (strict) conditional that if T, then also E. 
We can use our account of learning conditionals from Variation 1 in order 
to describe conditions when learning T F E raises the probability of T. 

Finally, a general, but popular critique of Bayesian approaches to the 
POE is inspired by the view that the POE poses a principled and insoluble 
problem for Bayesian Confirmation Theory. For instance, Glymour writes 
at the end of his discussion of the POE: 


[...] our judgment of the relevance of evidence to theory de- 
pends on the perception of a structural connection between the 
two, and [...] degree of belief is, at best, epiphenomenal. In 
the determination of the bearing of evidence on theory there 
seem to be mechanisms and strategems that have no apparent 
connections with degrees of belief. (Glymour, 1980, 92-93) 


What Glymour argues here is not so much that a specific formal as- 
pect of the Bayesian apparatus (e.g., logical omniscience) prevents it from 
solving the POE, but that these shortcomings are a symptom of a more 
general inadequacy of Bayesian Confirmation Theory: the inability to cap- 
ture structural relations between evidence and theory. This criticism should 
not be misunderstood as claiming that confirmation has to be conceived 
of as an objective relation that is independent of contextual knowledge or 
contingent background assumptions. Rather, it suggests that solutions to 
the dynamic POE mistake an increase in degree of belief for a structural re- 
lation between T and E. But what makes E relevant for T is not the increase 
in degree of belief p(T|E) > p(T), but the entailment relation between T 
and E. Hence Glymour’s verdict that Bayesian Confirmation Theory gives 
“epiphenomenal” results. 

To our mind, this criticism commits two oversights. First, solutions to 
the static POE answer Glymour’s challenge by showing how the concept 
of evidential support can be interpreted in an way that is not bound to 
belief updating at a particular point in time. We have tried to provide 
such an account in Section 3.3. Second, the criticism is too fundamental 
to be a source of genuine concern: it is not specific to the (dynamic) POE 
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or one of its solutions, but it attacks the entire Bayesian explication of 
confirmation as increase in firmness. However, as we have seen in the 
previous variation, Bayesian Confirmation Theory can point to a lot of 
success stories: resolving the tacking by conjunction paradoxes, the raven 
paradox, the new riddle of induction, and so on. What we have shown 
here is that confirmation by old evidence might be added to this list. The 
next variation moves on to an argument that we already touched upon in 
Section 3.2: failure to find adequate alternatives confirms a theory. 
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Proofs of the Theorems 


Proof of Theorem 3.1: Let a © p(I|XA-7Y), 6 “ pP(I|IXAY), ¢ 


p(TIAXA AY), 0 & p(TIAXAY), x & p(aY|x), and y & p(aY|AXx). 


Given these assignments, (3.4)—(3.7) translate as follows. 


def 


a>c a>o 
b>0 b>c 
Suppose that a € (0,1], 0 € [0,1), and 6,c,x,y € (0,1). ! Then, (3.4)-(3.7) 


jointly entail 
ax+b6(1—x) >cy+o(1—y). 


And, by the law of total probability, we have: 
pUl|xX) = ex+b6(1—x) 
p(T|>X) = cy +a(1—y) 


Thus, (3.4)-(3.7) jointly entail p(T|X) > p(T|7X), which entails 
p(T|X) > p(T). 


Proof of Theorem 3.2: First, we define 


e, = p(E|T, X) eo = p(E|=T, X) 
e3 = p(E|T, 7X) e4 = p(E|—T, 7X) 
t = p(T) r= p(X|T) 

= p(T|X) 7 = p(X|-T) 


By making use of [1] (e1 = 1), [2] (e2 = e4 > 0), and the Extension Theorem 
p(X|Z) = p(X|Y, Z)p(¥|Z) + p(X|AY, Z) p(-Y|Z), we can quickly verify the 
identities 


p(E|T) = p(B|T.X)p(XIT) + p(EIL, mX)p(>XIT) 


lThe only two conditional credences that may reasonably take extremal values here are 
a and 2. If T is the only theory that adequately explains E, then it may be reasonable to 
assign T maximal credence. And, if some alternative to T (e.g., T’) is the only theory that 
adequately explains E, then it may be reasonable to assign minimal credence to T. This is 
why we allow a € (0,1] and 0 € [0,1). The other conditional credences involved in our 
theorem (i.e., 6, c, x, y) should, in general, take non-extreme values. 
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= r+e3(1l—r 
P(E|-T) = p(E|>T,X)p(X[-T) + p(E|FT, >X)p(>X|-T) 
= eof +e,(1—7) 


—_~ 


= e2 
that will be useful later. Second, we note that by Bayes’ Theorem and 


assumption [1], 


= p(E|T, X) 
p(T|E,X) = p(TIX) TIX) 


7 (1 | Se 
p(T|X) p(E|T,X) 


1-# a 
(1 + i! : ea) (3.8) 


Third, we observe that by [1], [2] and the above identities for p(E|T) and 
p(E|-T), 


= p(E|T) 
_ (1 a 
p(T) p(ElT) 
1-t e ‘a 
< (1 t aa) G2) 


We also note by [3] that e3 < it 7_;: Note that it is implicit in condition [3] 
that 1 > p(X|T), p(X|=T) > 0 since otherwise, the expression would either 
be undefined (divison by zero), or p(E|I, =X) would have to be smaller 
than zero, which is impossible. 

This allows us to derive 


r+e3(l1—r) < rd ; oad Gy r) 
oe 
(2) 
r 
ar: 
mr 
and consequently also 
1 r 


(3.10) 


98 3.6. Proofs of the Theorems 


Moreover, note the equality 


La p(T) _ p(XIAT) p(t) _F 1-1 
7 oT) ~ pit) pt) 7 t ony 


i ee Vt e2 
| eee, a fa 
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All this implies that 


p(TIE,X) X) 6. 12),8.9) 
—p(T|E) 


(3.11) 


Lt a hae 
1+ * ea) * (1 ot ea) 


y 


completing the proof. The second line has also used that e2 > 0, as 
ensured by [2], and that t,r,7 € (0,1) also implies t’ € (0,1). 


Proof of Theorem 3.3: By means of performing the same steps as in the 
proof of Theorem 3.2, we can easily verify the equalities 


(1 p(=T|X) sy 
p(T|X)  p(E|T,X) 


1 1-?t' e2 = 
tf l-e 


=T) p(E|-T) 


( 

pm) = (14 
( 
( 


p(T|E,X) 


1—t ef+e(1—7) ) 
f£ (l= e¢)r beg (1 — 7) 


1-t e2 7 
: t woofers) 


where we have made use of [1’] and [2’]=[2]. We also note that [3’] implies 


Q=er+ed—r) < (—e)r+01—e) - Ta; r) 


= (er (1+==7) 
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(1—€) 


a oe’ 


and therefore also 


1 r 


(1 —e)r+e3 (1-1) - (1—e)r 


This brings us to the final calculation: 


(3.12) 


p(T|E, X) 7 1 . 1- e \" 
OE), oA ao eee eee ee oe 
a ae 1-t F e@ \* 
(1 t aaa") (1 "fb or 2.) 


where we have simultaneously applied Equations (3.11) and (3.12) in the 


second line. This completes the proof. 
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Variation 4: The No Alternatives Ar- 
gument 


In the previous chapters, we have described how empirical observations 
confirm or disconfirm a scientific hypothesis by means of probabilistic 
relevance. For example, the observation of a black raven raises the prob- 
ability of the hypothesis that all ravens are black, and certain clicks in a 
particle detector make us more confident in the existence of the top quark. 
However, there are situations where empirical evidence is unattainable 
over long periods of time. Such situations arise with particular force in 
contemporary high energy physics, where the characteristic empirical sig- 
natures of theories like Grand Unified Theories or string theory must be 
expected to lie many orders of magnitude beyond the reach of present 
day experimental technology. They are also entirely common in scientific 
fields such as palaeontology or anthropology, where scientists must rely 
on the scarce and haphazard empirical evidence they happen to find in 
the ground. Interestingly, scientists are at times quite confident regard- 
ing the adequacy of their theories even when empirical evidence is largely 
or entirely absent. In such cases, they base their trust on what we call 
non-empirical evidence: evidence for T that neither falls into the (broadly 
construed) intended domain of T nor is logically or probabilistically re- 
lated to T. Such evidence can, for example, consist in observations about 
the research process leading up to the construction of T, or the standing 
of T in the research community. The wording “non-empirical” is not sup- 
posed to express a rationalist or idealist concept of theory confirmation. 


From an empiricist point of view, arguments relying on non-empirical 
evidence may be regarded as mere speculation: they neither contribute 
to actual theory confirmation nor do they have objective scientific weight. 
We challenge this claim by exploring the following case: scientists develop 
considerable trust in a theory T because despite considerable efforts, no 
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alternatives to T have been found that meet crucial theoretical and em- 
pirical constraints. We call this argument the No Alternatives Argument 
(NAA) and set up a Bayesian model to prove its validity. The name of the 
argument stems from its crucial premise (scientists have not yet found a 
suitable alternative to T); it does not draw the conclusion that there are no 
alternatives to T. If valid, the NAA would demonstrate the possibility of 
non-empirical theory confirmation. 

A variant of NAA is actually often used in politics and is well-known 
under the acronym TINA: “There is no alternative”. The former British 
prime minister Margaret Thatcher was famous for promoting her politics 
of privatization and economic liberalization by means of this slogan. The 
present German chancellor, Angela Merkel, has also used TINA to defend 
her political course in the euro crisis (and more recently, in the refugee 
crisis) as alternativlos, that is, without alternatives: “If the euro falls, Eu- 
rope will fall, too”. An investigation of the NAA will therefore not only 
shed light on patterns of scientific reasoning and the possibility of non- 
empirical theory confirmation, but also elucidate the validity of argument 
patterns that are frequently used in the political debate. But above all, it 
complements and completes the investigation of confirmatory arguments 
in science that we have begun in the first three variations. 

The setup of this variation is very simple: Section 4.1 introduces a for- 
mal model of the NAA and makes plausible assumptions for the epistemic 
ramifications of non-empirical evidence. By contrast, Section 4.2 presents 
our main results, discusses their significance and explores an application 
to Inference to the Best Explanation. For more details, see Dawid et al. 
(2015). Section 4.3 reports the proofs. 


Modeling the No Alternatives Argument 


We would like to investigate whether observing a lack of alternatives to 
T confirms the empirical adequacy of T. Two disclaimers to begin with: 
if two theories make the same predictions, then we consider them to be 
identical. Moreover, we would like to sidestep debates about scientific 
realism (— Variation 5) and focus on the empirical adequacy of T rather 
than on its truth. 

On a Bayesian account of confirmation, the subjective degree of belief 
in T has to be raised by the lack of alternatives. But how is this possible? 
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After all, a lack of alternatives is neither deductively nor probabilistically 
implied by T. It does not even fall into the intended domain of T. Does 
this observation then qualify as (non-empirical) evidence in an argument 
from ignorance, such as: if T were not empirically adequate, then we would 
have disproved it before (Walton, 1995; Hahn and Oaksford, 2007; Sober, 
2009)? More generally, how is the Bayesian addressing the problem that 
there may always be unconceived alternatives to T which may explain the 
available evidence as well as T, or even better (Stanford, 2006)? 


The most plausible way to solve this problem is to deploy a two step 
process. First, we find a statement that predicts the failure to find alterna- 
tives to T. Then, we show that this statement provides evidential support 
for the empirical adequacy of T. In the case of NAA, our non-empirical 
evidence —F, consists in the fact that scientists have not found any alter- 
natives to a specific solution of a research problem, despite looking for 
them with considerable energy and for a long time. Obviously, the num- 
ber of satisfactory alternatives to T matters here. A small number of 
available alternatives renders —F,a more likely than a large number of al- 
ternatives: in the latter case, one might expect that scientists would have 
already found one of them. 


The number of scientific theories which can account for a certain set of 
data is in turn relevant for the degree of belief in the empirical adequacy of 
T. The more alternatives exist, the less is it likely that a particular theory is 
empirically adequate. In other words, the observation that scientists have 
not yet found an alternative to T indicates that there are not too many 
alternatives to T, and thus figures as an argument for T. A lower number 
of possible scientific theories that accounts for a certain set of empirical 
data increases the degree of belief that our actual theory is adequate—see 
also the argument from no choice (Dawid, 2006, 2009). 


Based on this reasoning, we introduce a random variable A measuring 
the number of alternatives to T and taking values in the natural numbers. 
A, := {A = k} expresses the proposition that there are k adequate and 
distinct alternatives which satisfy a set of theoretical constraints C, are 
consistent with the existing data D, and give distinguishable predictions 
for the outcome of some set € of future experiments. We will later show 
that, via its effect on the A,, the non-empirical evidence —F, confirms the 
empirical adequacy of T under plausible conditions. 


Inferences about the number of alternatives to a theory T naturally de- 
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pend on what counts as a genuine alternative. This is, in turn, sensitive to 
the specific scientific context. Therefore we leave the individuation prob- 
lem to the scientific community which typically has the best grip on what 
should count as a distinct theory. Moreover, for the No Alternatives Ar- 
gument, we only require the premise that the number of alternatives to T 
possibly be finite. In other words, we are not certain a priori that there are 
infinitely many alternatives. 

In order to motivate this assumption, we assume that different theories 
provide different solutions to a given research problem. That is, theories 
which only differ in a detail, such as the precise value of a parameter or 
the existence of a physically meaningless dummy variable, do not count as 
different theories. For example, the simple Higgs model in particle physics 
is treated as one theory, although the Higgs particle could have different 
mass values. Generally, if it were enough to slightly modify the value of 
a certain parameter in order to arrive at a new theory, then coming up 
with new theories would be an easy and not very creative task. Inventing 
a novel mechanism or telling a new story of why a certain phenomenon 


occurred is much harder. 


Scientists often formulate no alternatives arguments at the level of 
general conceptual principles while allowing for a large spectrum of spe- 
cific realizations of those principles. For example, since the 1980s particle 
physicists strongly supported a no alternatives argument with respect to 
the Higgs mechanism. That is, they believed that no alternatives to a 
gauge theory that was spontaneously broken by a Higgs sector of scalar 
fields could account for the available empirical data. Physicists strongly 
believed, based on a NAA, that the Higgs sector would be observed at the 
LHC experiment but did not have particular trust in any of the specific 
models of the Higgs sector. In this case, the NAA clearly was placed at 
the level of physical principles rather than specific models. 


Following this line of reasoning, we reconstruct NAA based on the 
notion that there exists a specific but unknown number k of possible sci- 
entific theories. As stated above, these theories have to satisfy constraints 
C, explain data D and predict the outcomes of the experiments €. We will 
then show that failure to find an alternative to T raises the probability of 
T being empirically adequate and thus confirms T. 

To do so, we introduce the binary propositional variables H and Fa. 
As before, H takes the values 
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H Theory T is empirically adequate. 


—H Theory T is not empirically adequate. 
and F, takes the values 


F, The scientific community has found an alternative to T that fulfills 
C, explains D and predicts the outcomes of €. 


—F, The scientific community has not yet found an alternative to T that 
fulfills C, explains D and predicts the outcomes of €. 


Figure 4.1: The Bayesian Network representation of the two-propositions 
scenario. 


We would now like to explore under which conditions —=F,a confirms 

H, that is, when 

p(A|“Fa) > p(A). (4.1) 
Figure 4.1 suggests a direct influence of H on F,. But since a direct in- 
fluence is blocked by the non-empirical nature of F4, we introduce a third 
variable A which mediates the connection between H and F,. A has values 
in the natural numbers, and A, corresponds to the proposition that there 
are exactly k hypotheses that fulfil C, explain D and predict the outcomes 
of €. 

We should also note that the value of F,4—that scientists find/do not 
find an alternative to T—does not only depend on the number of available 
alternatives, but also on the difficulty of the problem, the cleverness of the 
scientists, or the available computational, experimental, and mathematical 
resources. Call the variable that captures these complementary factors D, 
and let it take values in the natural numbers, with D; := {D = j} and 
d; := p(Dj). The higher the values of D, the more difficult the problem. 
For the purpose of our argument, it is not necessary to assign a precise 
operational meaning to the various levels of D—see condition A3 later on. 
It is clear that D has no direct influence on A and H (or vice versa), but 
that it matters for F, and that this influence has to be represented in our 
Bayesian Network. 

We now list five plausible assumptions that we need for showing the 
validity of the No Alternatives Argument. 
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Figure 4.2: The Bayesian Network representation of the four-propositions 
scenario. 


Al 


A2 


The variable H is conditionally independent of F4 given A: 
HILF,|A (4.2) 


Hence, learning that the scientific community has or has not found 
an alternative to T does not alter our belief in the empirical adequacy 
of T if we already know the value of A (e.g., that there are exactly k 
viable alternatives). 


The variable D is (unconditionally) independent of A: 
DILA (4.3) 


Recall that D represents the aggregate of those context-sensitive fac- 
tors that affect whether scientists find an alternative to T, but that are 
not related to the number of suitable alternatives. In other words, D 
and A are orthogonal to each other by construction. 


These are our most important assumptions, and we consider them to 


be eminently sensible. Figure 4.2 shows the corresponding Bayesian Net- 


work. To complete it, we have to specify the prior distribution over D 


and A and the conditional distributions over F4 and T, given the values of 


their parents. This is done in the following three assumptions. 


A3 


The conditional probabilities 
Sri — p(-Fa|Ax, Dj) (4.4) 


are non-increasing in k for all 7 € IN and non-decreasing in j for all 
keEN. 


The (weak) monotonicity in the first argument reflects the intuition 
that for fixed difficulty of a problem, a higher number of available 
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alternatives increases the chance of finding one of them. In other 
words, the more reasonable alternatives to T are around, the less 
likely it is that scientists fail to find one. The (weak) monotonicity in 
the second argument reflects the intuition that increasing difficulty of 
a problem does not increase the likelihood of finding an alternative 
to T, provided that the number of alternatives to T is fixed. 


A4 The conditional probabilities 
te = p(HIA,) (4.5) 


are non-increasing in k. 


This assumption reflects the intuition that an increase in the number 
of alternative theories does not make it more likely that scientists 
have already identified an empirically adequate theory. 


A5 There is at least one pair (i,k) with i < k for which (i) a; a, > 0 where 
a := p(Ax), (ii) fii > fej for some j € IN, and (iii) t; > tr. 
This assumption demands that at least two of the a; be greater than 
zero, and it strengthens A3 and A4 by demanding that the fj; and t; 


not be constant in i. 


Results and Discussion 


The previous section has set up a formal model of the NMA in a Bayesian 
Network (see Figure 4.2) and made five assumptions on how the variables 
in that network hang together (see A1-A5). With these assumptions in 


hand, we can now show the following main result: 


Theorem 4.1 (Validity of the NAA) If A takes values in the natural numbers 
IN and assumptions A1 to A5 hold, then —F, confirms H, that is, p(H|—Fa) > 


p(H). 


We have therefore shown that —F, confirms the empirical adequacy of 
T under rather weak and plausible assumptions. 

In line with the introduction of A in section 4.1, we have assumed that 
A only takes values in the natural numbers. This might be seen as evading 
the skeptical argument that there may be infinitely many (theoretically 
adequate, empirically successful, ...) alternatives to T. Therefore we now 
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extend the theorem by explicitly allowing for the possibility Aw := {A = 
co}, and we modify our assumptions accordingly. In particular, we observe 
that A5 entails p(Aw) < 1, define fooj := p(7FAa|Ac, Dj), too := p(H|Ac) 
and demand that 


feifog Yh] IN fooi < feoj Vi,j € IN withi < j (4.6) 


These requirements naturally extend assumptions A3 and A4 to the case of 
infinitely many alternatives. Then, we obtain the following generalization 
of the NAA: 


Theorem 4.2 (Validity of the NAA, infinitely many alternatives) If <A 
takes values in IN U {co} and assumptions A1 to A5 hold together with their 
extensions (4.6) and (4.7), then =F, confirms H, that is, p(H|—=Fa) > p(H). 


In other words, even if we concede to the skeptic that there may be 
infinitely many alternatives to T, she must still acknowledge the validity 
of NAA as long as her degrees of belief satisfy p(Acw) < 1. This is, to our 
mind, a quite substantial and surprising result. For a long time, philoso- 
phy of science has focused on logical and probabilistic relations between 
theory and evidence and neglected other forms of theory confirmation. 
However, the above theorem demonstrates that non-empirical evidence (in 
our specific sense of the word) can raise our confidence in the empirical 
adequacy of a theory. 

Note that only a dogmatic skeptic who insists on p(Aw) = 1 can deny 
the validity of NAA. Theorem 4.2 convinces anyone who does not want to 
commit herself with respect to the probability of (in)finitely many genuine 
alternatives to T. (Recall that theories do not count as distinct when they 
are just different realizations of the same mechanism or principle.) Con- 
vincing such a fair and non-committal skeptic is, to our mind, much more 
important than convincing a dogmatic who just denies our premises by 
insisting on p(Aw) = 1. That assumption might only be warranted if we 
were interested in the truth of T rather than its empirical adequacy. 

We have seen that the NAA can be used in support of a proposed 
theory. The question remains, however, whether the resulting support is 
of significant strength and whether using the NAA in a specific situation 
is justified. To facilitate matters, we conduct this analysis for the finite case 
(Theorem 4.1); the infinite case is analogous. 
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The Bayesian Network representation of NAA in Figure 4.2 suggests 
that the NAA cannot easily obtain confirmatory significance without sup- 
portive reasoning. According to Figure 4.2, -F,4 may confirm an instance 
of D—limitations to the scientists’ abilities to solve difficult problems—as 
well as an instance of A, such as limitations to the number of possible 
theories. It is then easy to see that for all] € IN, 


p(DL 7Fa) — di Le ak fu 
p(-Fa) Lp Aj Mk fj 


p(Di|>Fa) = 


which may be greater than p(D)) for plausible assignments of parameter 
values. To successfully apply NAA, one has to amend the qualitative claim 
shown above with a comparative claim, namely that —Fa confirms T more 
than the claim {D > K} (“the problem is just very difficult”) for some 
threshold K. But such a statement is sensitive to the specific parameter 
assignments as well as to the chosen confirmation measure—and therefore 
hard to prove in general. Applied to the political context (the TINA variant 
of NAA), this result means that the failure to find viable alternatives to 
a particular policy does indeed confirm that the chosen policy may be 
the best one, in the sense of confirmation as increase in firmness. But 
without additional assumptions, it would be invalid to conclude that the 
probability has increased substantially, let alone that we should now be 
confident (e.g., with a degree of belief greater than 1/2) that the chosen 
policy is the best one. Be this as it may, we would like to stress that even 
without such a comparative assessment, the validity of the NAA, that it 
raises the probability of T being empirically adequate, is a surprising and 
substantial philosophical result. 

The NAA some interesting philosophical perspectives. In particular, 
Inference to the Best Explanation (Lipton, 2004; Douven, 2011) can, to a cer- 
tain extent, be explicated in terms of NAA. The fact that no other gen- 
uinely satisfactory explanation has been found corresponds to the failure 
to find alternatives in the NAA pattern. Then, the structure of the argu- 
ment is similar; only the interpretation changes from “empirically ade- 
quate” to “best/only satisfactory explanation”. The relevant propositional 
variables then read as follows: 


H The hypothesis T is the only satisfactory explanation of phenomenon 


Ge 
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4H The hypothesis T is the only satisfactory explanation of phenomenon 


oe 


F, The scientific community has found an alternative to T that explains 


e: 


=F, The scientific community has not yet found an alternative to T that 
explains €. 


A, then denotes the number of alternatives that explain €. It is not difficult 
to motivate analogues of A1-A5 for this interpretation of our propositional 
variables, and to derive that =F, confirms H. We conjecture that some In- 
ferences to the Best Explanations in science are actually NAA’s in disguise: 
they take the failure of attempts to find an alternative as a reason to infer 
the truth or empirical adequacy of the only available hypothesis. 

Second, the reasoning scheme of the NAA is similar to eliminative 
induction in the style of Francis Bacon, or more recently, Arthur Conan 
Doyle’s figure Sherlock Holmes: “when you have eliminated the impossi- 
ble, whatever remains, however improbable, must be the truth”. Could we 
use the NAA as a case study for creating deeper links between Bayesian 
and eliminative inference (Earman, 1992; Forber, 2011)? 

Third, our theoretical analysis should be complemented by more case 
studies in science: from string theory as a classical application of the NAA 
(Dawid, 2006, 2009), and also from other disciplines where empirical evi- 
dence is scarce, such as palaeontology, archaelogy or anthropology. 

Fourth and last, the relationship between NAA and the TINA argu- 
ment in public policy should be investigated more closely. Can the en- 
dorsement of a defended policy really be defended with a type of NAA? 
Or does “failure to find viable alternatives” mean something very differ- 
ent in the political context, invalidating the application of the NAA in that 
domain? 

In the next variation, we will address the issue of scientific realism 
which looms behind the NAA and related reasoning schemes. In par- 
ticular, we will show how the model underlying NAA, with an explicit 
probability distribution over the number of alternatives to T, can be used 
to develop a sophisticated Bayesian version of the No Miracles Argument 
(NMA). 
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Proof of the Theorems 


Proof of Theorem 4.1: =F, confirms H if and only if P(H|=F,) — P(H) > 
0, that is, if and only if 


A := P(H,7Fa) — P(H)P(-Fa) > 0. 


We now apply the theory of Bayesian Networks to the structure depicted 
in Figure 4.2, using assumptions A1 (H IL F,4|A) and A2 (D LL A): 
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because of A3-A5 taken together: A3 entails that the difference (fj; — f;;) is 
non-negative, A4 does the same for the (f; — t,), and A5 entails that these 
differences are strictly positive for at least one pair (i,k). Hence, the entire 


double sum is strictly positive. 


Proof of Theorem 4.2: We perform essentially the same calculations as in 
the proof of Theorem 4.1 and additionally include the possibility {A = 
co}.* Defining fooj := P(-=Fa|DjAco) leads us to the equalities 


P(“Fa) = Qi) dj ai fig + pd) Aco foo) 
i=0 j=0 j=0 
P(H) = T ieatious 
k=0 
Porat) = ES diate Stitt hy 


iL 
° 
™—. 
ll 
° 


from which it follows, using limx_+0 ey ay = 1 — Ac, that 


P(AFa)P(H) = YOY dita; whit ded too Aj Aoo fij 
i=0 j=0 k=0 i=0j=0 
+>) od eee eo Br are 
an j=0 
P(=Fa,H) = — > dyson + Yo bots fy 
© j=0 j=0 k=0 


With the definition 


co wo Ww 


ee jtiai afij — Dd de dj tea fi 


i=0 j=0k=0 


iMe 


we observe that A > 0, as shown above in the proof of Theorem 4.1 (the 
parameter values satisfy the relevant conditions A3-A5). Noting that A5 


requires Mo. < 1, it follows that 


P(-=F,,H) — P(H) P(-Fa) 


2The notation suggests that oo is already included in the summation index, but the 
infinity sign on top of the sum is just the shortcut for the limit of the sequence of all 
natural numbers. Thus, the case A = o9 has to be treated separately. 
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wo wo wo 
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> 0 


since the extensions of A3 and A4 imply fi; = fooj and tj = too (equations 
(4.6) and (4.7)), independent of the values of i and j. 
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Variation 5: Scientific Realism and the 
No Miracles Argument 


The debate between scientific realists and anti-realists is one of the classics 
of philosophy of science, comparable to a soccer match between Brazil 
and Argentina. Realism comes in different varieties, e.g., metaphysical, 
semantic and epistemological realism (see Chakravartty, 2011, for a sur- 
vey). The most ambitious and most contested form of realism is the 
epistemological thesis that we are justified to believe in the truth of our 
best scientific theories, and that they constitute knowledge of the external 
world (Boyd, 1983; Psillos, 1999, 2009). In this view, the existence of a 
mind-independent world (metaphysical realism) and the reference of the- 
oretical terms to mind-independent entities (semantic realism) is usually 
presupposed—the real question concerns the epistemic status of our best 
scientific theories. 


In this variation, we demonstrate how Bayesian methods can be used 
for clarifying and sharpening the debate between realists and anti-realists. 
It consists of three parts. In the first part, we explain the No Miracles 
Argument (NMA) and examine the well-known objection that the real- 
ist commits the base rate fallacy (Howson, 2000; Magnus and Callender, 
2004). The second and third part respond to this issue. In the second part, 
we demonstrate how observed stability of scientific theories can make a 
case for scientific realism and we argue that the validity of the NMA cru- 
cially depends on the specific context of a scientific discipline. In the third 
part, we show how shifting from individual theories to a series of theories 
in a scientific discipline may alleviate the base rate fallacy. Both arguments 
are refinements of the NMA within a Bayesian model. Note that none 
of our arguments makes a sufficient case for scientific realism: the scope 
of the probabilistic NMA does not extend beyond claims to empirical ad- 
equacy. However, the NMA is necessary for parrying vital threats to the 
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realist view, such as Laudan’s argument from Pessimistic Meta-Induction. 

We do not argue that Bayesian philosophy of science should be aligned 
with a realist stance. Bayesian philosophy of science is a methodological 
approach to investigating substantial philosophical questions, not a spe- 
cific answer to them. Rather, we show how Bayesian methods can be used 
to clearly articulate the realist argument, to investigate its validity and to 
determine its scope. 


The Probabilistic No Miracles Argument 


A major player in the debate about scientific realism is the No Miracles 
Argument (NMA). It contends that the truth of our best scientific theo- 
ries is the only hypothesis that does not make the astonishing predictive, 
retrodictive and explanatory success of science a mystery (Putnam, 1975, 
73). If our best scientific theories did not correctly describe the world, 
why should we expect them to be successful at all? The truth of our best 
theories is an excellent, and perhaps the only explanation of their suc- 
cess. Therefore, we should accept the realist hypothesis: our best scientific 
theories are true and constitute knowledge of the world. 

It is not entirely clear whether the NMA is an empirical or a super- 
empirical argument. As an argument from past and present success of 
our best scientific theories to their truth, it involves two major steps: the 
step from observed success to justified belief in empirical adequacy, and 
the step from justified belief in empirical adequacy to justified belief in 
truth (see Figure 5.1). The first of them is an empirical inference, the sec- 
ond most probably not: ordinary empirical evidence cannot distinguish 
between different theoretical structures that yield the same observable con- 
sequences. 

Much philosophical discussion has been devoted to the second step of 
the NMA (e.g., Psillos, 1999; Lipton, 2004; Stanford, 2006), which seems 
in greater need of a philosophical defense. After all, the realist has to 
address the problem of underdetermination of theory by evidence. But 
also the first step of the NMA is far from trivial, and strengthening it 
against criticism is vital for the scientific realist. For instance, Laudan 
(1981) has argued that there have been lots of successful, but non-referring 
(and empirically inadequate) scientific theories. If Laudan were right, then 
the entire NMA would break down, even if objections to the second step 
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Figure 5.1: The structure of the NMA as a two-step argument from the 
empirical success of T to its truth. We conceptualize the NMA as an 
argument for the first inference in this figure, that is, for an inference 
from empirical success of T to its empirical adequacy. 


could be answered successfully. 

Such arguments do not only threaten full scientific realists, but also 
structural realists (Worrall, 1989) and some varieties of anti-realism that 
make substantial epistemic commitments. One of them is Bas van 
Fraassen’s constructive empiricism (van Fraassen, 1980; Monton and 
Mohler, 2012). Proponents of this view deny that we have reasons to 
believe that our best scientific theories are literally true. However, they 
affirm that we are justified to believe in their observable parts. Thus they 
are also affected by criticism and defense of the first step of the NMA. 

Hence, the first step of the NMA does not draw a sharp divide between 
realists and anti-realists. Rather, the debate takes place between those who 
derive epistemic commitments from the success of science, and those who 
deny them. This variation is devoted to exploring the question of whether 
such epistemic commitments are justified. For convenience, we stick to 
the traditional terminology and refer to the first group as “realists” and to 
the second group as “anti-realists”. We begin with a Bayesian analysis of 
a simple form of the NMA. 

We first apply the NMA to a particular scientific theory T which is pre- 
dictively and explanatorily successful in a certain scientific domain. Since 
we only investigate arguments for the empirical adequacy of T, we intro- 
duce a propositional variable H—the hypothesis that T is empirically ad- 
equate. See Figure 5.2 for a simple Bayesian network representation of the 
dependence between H and the propositional variable S that represents 
the empirical success of T. 
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Figure 5.2: The Bayesian Network representation of the impact of H—the 
empirical adequacy of theory T—on the empirical success of T, denoted by 
S. 


Expressed as a Bayesian argument, the simple NMA then runs as fol- 
lows: S is much more probable if T is empirically adequate than if it is not. 
This can be expressed by the following two assumptions: 


A; s:= p(S|H) is large. 

Ay s’:= p(S|nH) <k <1. 

From Bayes’ Theorem, we can then infer 
p(HIS) > p(H) 


In other words, S confirms H: our degree of belief in the empirical ade- 
quacy of T is increased if T is successful. 

Anti-realists object to the above argument that the inequality p(H|S) > 
p(H) falls short of establishing the first step of the NMA. We are primar- 
ily interested in whether H is sufficiently probable given S, not in whether 
S raises our degree of belief in H. After all, the increase in probability 
could be negligibly small. The result p(H|S) > p(H) does not establish 
that p(H|S) > K for a critical threshold K, e.g., K = 1/2. We already 
know this distinction between posterior probability and incremental con- 
firmation from Variation 2, under the name of confirmation as firmness 
vs. confirmation as increase in firmness. 

More specifically, it has been argued that the NMA commits the base 
rate fallacy (Howson, 2000; Magnus and Callender, 2004). This is an un- 
warranted type of inference that frequently occurs in medicine. Consider 
a highly sensitive medical test which yields a positive result. On the other 
hand, the medical condition in question is very rare, that is, the base rate 
of the disease is very low. In such a case, the posterior probability of the 
patient having the disease, given the test, will still be quite low. Nonethe- 
less, medical practitioners tend to disregard the low base rate and to infer 
that the patient really has the disease in question (e.g., Goodman, 1999a). 

This objection can be elucidated by a brief inspection of Bayes’ The- 
orem. Our quantity of interest is the posterior probability p(H|S), our 
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confidence in H given S. This quantity can be written as 
p(H) p(S|H) 
His) = ——— 
p(Hs) 58) 
1— p(H) =) 
= (14 (5.1 
(+ Say FR 


which shows that p(H|S) is not only an increasing function in p(S|H) and 


a decreasing function in p(S|—H): its value crucially depends on the base 
rate or prior plausibility of H, p(H) (=the prior plausibility of H). 

Anti-realists claim that NMA is built on a base rate fallacy: from the 
high value of p(H|S) (“the empirical adequacy of T explains its success”) 
and the low value of p(S|=H) (“success of T would be a miracle if T were 
not empirically adequate”), justified belief in H (“T is empirically ade- 
quate”) is inferred. The probabilistic model of the NMA demonstrates 
that we need additional assumptions about p(H) to warrant this infer- 
ence. In the absence of such assumptions, the NMA does not entitle us to 
accept T as empirically adequate. 

What do these considerations show? First, they expose that the NMA, 
reconstructed as a probabilistic inference to the posterior probability of H, 
is essentially subjective. After all, any weight of evidence in favor of H can 
be counterbalanced by a sufficiently skeptical prior, that is, a sufficiently 
low value assigned to p(H). The realist needs to provide convincing rea- 
sons why p(H) should not be arbitrarily close to zero, and such reasons 
will typically presuppose realist inclinations. This is a problem for those 
realists who claim that the NMA is an intersubjectively compelling argu- 
ment in favor of scientific realism. Howson (2013, 211) concludes that due 
to the dependence on unconstrained prior degrees of belief, the NMA is, 
“as a supposedly objective argument, [...] dead in the water”. See also 
Howson (2000, ch. 3), Lipton (2004, 196-198), and Chakravartty (2011). 

What is more, a low value for our prior degree of belief in H may 
be more rational than a high value. Take, for example, Larry Laudan’s 
argument from Pessimistic Meta-Induction (PMI): “I believe that for every 
highly successful theory in the past of science which we now believe to 
be a genuinely referring theory, one could find half a dozen successful 
theories which we now regard as substantially non-referring” (Laudan, 
1981, 35). Why should our currently best theory T, = T not suffer the 
same fate as it predecessors Tj, ..., T,-1 which proved to be empirically 
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inadequate although they were once the best scientific theory? 

PMI affects the values of p(S|=H) and p(H) as follows: On the one 
hand, history teaches us that there have often been false theories that ex- 
plained the data well (and were superseded later). In other words, empir- 
ically inadequate theories can be highly successful and p(S|—=H) need not 
be low. On the other hand, PMI advises a low degree of belief that T is 
empirically adequate since in the past, comparable theories turned out to 
be false. 

To substantiate these concerns, we conduct a numerical analysis of 
the probabilistic NMA. For the sake of simplicity, let us sharpen A; to 
s = p(S|H) = 1: if theory T is empirically adequate, then it is also suc- 
cessful. Furthermore, define s’ := p(S|=H) and let h := p(H) be the prior 
probability of H. We now ask the question: for which values of s’ and h 
is the posterior probability of H, p(H|S), greater than 1/2? That is, when 
would it be more plausible to believe that T is empirically adequate than 
to deny it? Satisfying this condition is arguably a minimal requirement for 
the claim that the success of T entitles us to justified belief in its empirical 
adequacy. 

By using Bayes’ Theorem, we can easily calculate when the inequality 
p(H|S) > 1/2 is satisfied. Equation (5.1) brings us to the inequality 


1 la hy\ 
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h 
/ 
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See Figure 5.3 for a graphical illustration. 


which can be written as 


However, inequality (5.2) is not easy to satisfy. As mentioned above, 
false theories and models often make accurate predictions and perform 
well on other cognitive values (see Frigg and Hartmann, 2012, for an 
overview). Classical examples that are still used today involve Newtonian 
mechanics, the Lotka-Volterra model from population biology (e.g., Weis- 
berg, 2007) and Rational Choice Theory. Hence, the value of s’ = p(S|4H) 
should not be too low. But if we choose, for example, s’ = 1/4, then 
we would require p(H) € [1/3,1] to satsify inequality (5.2) and to make 
the NMA work! In other words, the NMA only works for theories which 
are already likely to be empirically adequate. What is more, for a mildly 
skeptical prior such as p(H) = 0.05, the value of s’ would have to be in 
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Figure 5.3: The scope of the No Miracles Argument, represented graphically. 
p(H|S) > 1/2 is the case in the white area below the line. 


the range [0,0.053]. This amounts to making the assumption that only the 
empirical adequacy (or truth) of a scientific theory can explain its success. 
But this is essentially a realist premise which the anti-realist would refuse 
to accept. She could point to the existence of unconceived alternatives 
(Stanford, 2006, ch. 6), the explanatory successes of false theories, etc. In 
other words: the simple probabilistic model of the NMA demonstrates 
that (1) to the extent that the NMA is valid, its premises presuppose real- 
ist inclinations; (2) to the extent that the NMA builds on premises that are 
neutral between the realist and the anti-realist, it fails to be valid. 


Are things thus hopeless for the realist who wants to convince the 
anti-realist that the NMA is a good argument? Does “all realistic hope 
of resuscitating the [no miracles] argument [fail]”, as Howson (2013, 211) 
writes? Perhaps not necessarily so. So far, the probabilistic NMA only 
took into account the predictive and explanatory success of T. Now we 
also consider the stability of scientific theories as evidence for scientific 
realism. This move is related to the No Alternatives Argument (NAA, 
— Variation 4), and in fact, our probabilistic model will be inspired by 
NAA-type reasoning. 
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Extending the No Miracles Argument to Stable Scien- 
tific Theories 


Recently, Ludwig Fahrbach (2009, 2011) has argued that the stability of 
major scientific theories in the second half of the 20th century provides 
a strong argument in favor of scientific realism. In this section, we show 
how observing theoretical stability in a scientific discipline could give a 
boost to the probabilistic NMA. 


Fahrbach’s argument is mainly based on scientometric data. He ob- 
serves an exponential growth of scientific activity in the 20th century, with 
a doubling of scientific output every 20 years (Meadows, 1974). He also 
notes that at least 80% of all scientific work has been done since the year 
1950 and observes that our best scientific theories (e.g., the periodic table of 
elements, optical and acoustic theories, the theory of evolution, etc.) were 
stable during that period of time. That is, they did not undergo rejection 
or major conceptual change. On the other hand, Laudan’s examples in 
favor of PMI stem from the early periods of science, e.g., the caloric theory 
of heat, the ether theory in physics, or the humoral theory in medicine. 

For giving a fair assessment of PMI, we have to take into account the 
amount of scientific work done in a particular period. This implies, for ex- 
ample, that the period 1800-1820 should receive much less weight than the 
period 1950-1970. According to Fahrbach, PMI fails because most “theory 
changes occurred during the time of the first 5% of all scientific work ever 
done by scientists” (Fahrbach, 2011, 149). If PMI were valid, we should 
have observed more substantial theory changes or scientific revolutions in 
the recent past. However, although the theories of modern science often 
encounter difficulties, revolutionary turnovers do not (or only very rarely) 
happen. According to Fahrbach, PMI stands refuted—or at the very least, 
it is not more rational than an optimistic meta-induction. 

Certainly, Fahrbach’s model is quite simplified. For example, the num- 
ber of published papers in a discipline need not be a reliable indicator of 
the probative value of a scientific theory. However, we are not interested 
in whether Fahrbach sketches an accurate picture of 20th century science. 
Rather, we will use a Bayesian framework for showing that such observa- 
tions can in principle support the realist thesis. More precisely, we explore 
if observations of long-term stability expand the scope of the NMA. To 
this end, we refine our probabilistic model from the previous section. 
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As before, the propositional variable H expresses the empirical ade- 
quacy of theory T, and S denotes the predictive, retrodictive and explana- 
tory success of T. The integer-valued random variable A expresses the 
number of satisfactory alternatives to T, and A; is our shortcut for the 
proposition A = j. Like in the previous variation on the NAA, we de- 
mand that genuine alternatives satisfy a set of (context-dependent) theo- 
retical constraints C, be consistent with the currently available data D, and 
give distinguishable predictions for the outcome of some set € of future 
experiments. In line with our focus on empirical adequacy rather than 
truth, we do not distinguish between empirically equivalent theories with 
different theoretical structures. Finally, major theory change in the domain 
of T is denoted by C, and absence of change and theoretical stability by 
aC. “Theory change” is understood in a broad sense, including scenarios 
where rivalling theories emerge and end up co-existing with T. 

The dependency between these four propositional variables—A, C, H 
and S—is given by the Bayesian network in Figure 5.4. S, the success of 
theory T, only depends on the empirical adequacy of T, that is, on H. The 
probability of H depends on the number of distinct alternatives that are 
also consistent with the current data, etc. Finally, C, the probability of 
observing substantial theory change, depends on S and A: the empirical 
success of T and the number of available alternatives. To rule out preser- 
vation of a theory by a of series degenerative accommodating moves, the 
variable C should be evaluated over a longer period (e.g., 30-50 years). 


(#)—— 
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Figure 5.4: The Bayesian Network representation of the relation between 
variables A (the number of alternatives to T), H (empirical adequacy of 
theory T), S (success of T) and C (major theory change). 


We now define a number of real-valued variables in order to facilitate 
calculations: 
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e Denote by a; := p(Aj;) the probability that there are exactly j alterna- 
tives to T that satisfy the theoretical constraints C, are consistent with 
current data D and give definite predictions for future experiments 
&, ete. 


e Denote by h; := p(H|Aj) the probability that T is empirically ade- 
quate if there are exactly j alternatives to T. 


e As before, denote by s := p(S|H) and s’ := p(S|—H) the probability 
that T is successful if it is (not) empirically adequate. 


e Denote by cj; := p(7C|A;,S) the probability that no substantial the- 
ory change occurs if T is successful and there are exactly j alterna- 
tives to T. 


Suppose that we now observe —C (no substantial theory change has 
occurred in the last decades) and S (theory T is successful). The Bayesian 
network structure allows for a simple calculation of the posterior proba- 
bility of H. 


Proposition 5.1 The posterior probability of H given —C and S is given by 


i294 cS h; 
H|-=CS) = — d 5.3 
P( | ) Lixo 4 ¢j (shj+s'(1—h;)) ( ) 


We now make some assumptions on the values of these quantities. 


BO The variables A, C, H and S satisfy the (conditional) independencies 
in the Bayesian Network structure of Figure 5.4. 


B1 If T is empirically adequate then it will be successful in the long run: 
p(S|H) = 1, 


B2 The empirical adequacy of T is no more or less probable than the em- 
pirical adequacy of an alternative which satisfies the same set of 
theoretical and empirical constraints: h; := p(H|Aj) = 1/(j+1). In 
other words, there is no “actualist bias” in favor of T. 


B3 The more satisfactory alternatives exist, the less likely is an extended 
period of theoretical stability. In other words, cj := p(—C|Aj;) is a 
decreasing function of j. For convenience, we choose cj := 1/(j +1). 
(This particular assignment will be relaxed later on.) 
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B4 Assume that T is our currently best theory and we happen to find a 
satisfactory alternative T’. Then, the probability of finding another 
alternative T” is the same as the probability of finding T’ in the first 
place. Formally: 


p(A > j|\VV Ax) =p(A>jt+1| Vo Ax) Wj 2 0. (5.4) 

k=j k=j+1 
In other words, Equation (5.4) expresses the idea that finding an 
alternative does not, in itself, raise or lower the probability of finding 


another alternative. 


Note that BO-B4 are equally plausible for the realist and the anti-realist. 
In other words, no realist bias has been incorporated into the assumptions. 
We can now show the following proposition (proof in the appendix): 


Proposition 5.2 From Equation (5.4) it follows that a; := ag + (1 — ao). 
Together with this proposition, BO-B4 allow us to rewrite Equation 
(5.12) as follows: 


p(H|-CS) = (5.5) 


With the help of this formula, we can now rehearse the NMA once 
more and determine its scope, that is, those parameter values where 
p(H|=CS) > 1/2. The two relevant parameters are ao, the prior proba- 
bility that there are no satisfactory alternatives to T, and s’, the probability 
that T is successful although not empirically adequate. Since an analytical 
solution of Equation (5.5) is not feasible, we conduct a numerical analysis. 
Results are plotted in Figure 5.5. 

These results are very different from the ones in the previous section. 
With the hyperplane z = 0.5 dividing the graph into a region where T 
may be accepted and a region where this is not the case, we see that the 
scope of the NMA has increased substantially compared to Figure 5.3. For 
instance, 49 = p(H) > 0.1 suffices for a posterior probability greater than 
1/2, almost regardless of the value of s’. This is a striking difference to the 
previous analysis where way more optimistic values had to be assumed in 
order to make the NMA work. 

So far, the analysis has been conducted in terms of absolute confir- 
mation, that is, the posterior probability of H. We now complement it 
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Figure 5.5: The scope of the No Miracles Argument in the revised formulation. 
The posterior probability of H, p(H|=CS), is plotted as a function of (1) the prior 
probability that T is empirically adequate (ag); (2) the probability that T is suc- 
cessful if T is false (s’ = p(S|“H)). The hyperplane z = 1/2 is inserted in order 
to show for which parameter values p(H|—=CS) is greater than 1/2. 


by an analysis in terms of confirmation as increase in firmness. That is, 
we calculate the evidential support that ~CS confers on H. We use the 
log-likelihood measure /(H, E) = log, p(E|H)/p(E|—H) which has a good 
reputation in confirmation theory (— Variation 2) and a firm standing in 
scientific practice (e.g., Royall, 1997; Good, 2009). Also, it is a confirma- 
tion measure that describes the discriminative power of the evidence with 
respect to the realist and the anti-realist hypothesis and that is relatively 
insensitive to prior probabilities. The necessary calculations can be found 
in the final section of this variation. 

In Figure 5.6, we have plotted the degree of confirmation as a function 
of the value of s’, for three different values of aj, namely 0.01, 0.05 and 
0.1. As visible from the graph, the (logarithmic) degree of confirmation is 
substantial for all three cases, even for large values of s’. In particular, it is 
robust vis-a-vis the values of ag and s’ and able to withstand the anti-realist 
argument that plagued the original version of the NMA. Note that if s’ is 
small, as it will often be the case in practice, the logarithmic (!) degree of 
confirmation comes close to 10, which corresponds to a likelihood ratio of 
more than 1.000! And even if an anti-realist insists that s’ + .2—not a very 
plausible assumption—, the likelihood ratio hovers in the range between 
15 and 30. This finding accounts for the realist intuition that the stability 
of scientific theories over time, together with their empirical success, is 
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Figure 5.6: The degree of confirmation /(H, ~CS) = log, p(=CS|H)/p(-CS|-=H), 
for three different values of ao. Full line: ag = 0.01. Dashed line: ag = 0.05. Dot- 
dashed line: ag = 0.1 


strong evidence for their empirical adequacy. 

Finally, we relax our assumptions BO-B4. Qualitatively, our results 
do not change if we replace B1 with the more cautious formulation 
p(S|H) = 1— e. More interesting is a robustness analysis regarding the 
explication of B3. Arguably, the function c; := p(-C|Aj,S) = 1/(j +1) 
suggests that scientists are quite ready to give up on their currently best 
theory in favor of a good alternative. But as many have philosophers and 
historians of science have argued (e.g., Kuhn, 1977b), scientists may be 
more conservative and continue to work in the standard framework, even 
if good alternatives exist. Therefore we also analyze a different choice of 
the cj, namely c; := rele where c; falls more gently in j. This choice 
can then be plugged into Equation (5.12), yielding values of p(H|—=CS) 
that are different from the ones in Equation (5.5). 

The corresponding graph of p(H|=CS), as a function of ao and s’, is 
presented in Figure 5.7. We have set a = 4, corresponding to a high 
degree of reluctance to reject the currently best theory. Yet, the results 
match those from Figure 5.5: the scope of the NMA is much larger than in 
the simple version of the probabilistic NMA. Hence, our findings seem to 
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be robust toward different choices of Cj. 


Figure 5.7: The scope of the No Miracles Argument in the revised formulation, 


my) 
with cj := e2(a). The posterior probability of H, p(H|=CS), is plotted as a 
function of ag and s’, like in Figure 5.5, and contrasted with the hyperplane z = 
1/2. 


All in all, our model shows that a probabilistic NMA need not be 
doomed. Its validity depends crucially on the disciplinary context where 
it operates in. What are our expectations regarding the invention of sat- 
isfactory alternatives to T? Has the discipline been in a long period of 
theoretical stability? And so on. Of course, our model makes simplify- 
ing assumptions, but unlike the assumptions in the original model, they 
do not carry a realist bias. This allows for a more nuanced and context- 
sensitive assessment of realist argument. The first step of the NMA is 
valid when theories are stable and the discipline allows for few poten- 
tial explanations of observed phenomena. Anti-realist objections are sup- 
ported by case studies where scientific theories have been volatile or one 
of our assumptions BO-B4 is implausible. The probabilistic reconstruction 
of the NMA can thus explain and guide the strategies that realists and 
anti-realists pursue when defending their positions. 

We would like to stress that the context-sensitivity of the NMA is not a 
vice, but a virtue. It explains why realists and anti-realists often talk past 
each other, and it sketches a fruitful research program for future case stud- 
ies. In particular, more research is needed into which areas of science have 
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been theoretically stable, and whether the kind of stability that Fahrbach 
cites is genuine or based on a superficial continuity that hides substantial 
meaning changes. We now proceed to another variation of the NMA: the 
frequency-based No Miracles Argument. 


The Frequency-Based No Miracles Argument 


In the above analysis, the empirical adequacy of a particular theory T ex- 
plained why T is predictively successful. We shall call a NMA that tries 
to derive the empirical adequacy of theory T from its predictive success 
an individual-theory-based NMA. However, there is another way of con- 
ceptualizing the NMA. Following this second understanding, what is to 
be explained by the realist conjecture is not the empirical adequacy of a 
particular theory (e.g., the Standard Model of modern physics), but the 
tendency of theories in mature science to be empirically adequate. In this 
version, the NMA primarily relies on observed characteristics of science 
as a whole, or of a specific segment of science. Theories that are part of 
that segment, such as theories that are part of mature science or that are 
part of a specific mature research field, are expected to have a high rate of 
being empirically adequate. We will call a NMA based on the frequency 
of predictive success frequency-based NMA. This is actually the type of 
NMA that Hilary Putnam put forward in his famous first formulation of 
NMA: 


The positive argument for realism is that it is the only phi- 
losophy that does not make the success of science a miracle. 
That terms in mature science typically refer [...], that the the- 
ories in a mature science are typically true, that the same term 
can refer to the same thing even when it occurs in different 
theories—these statements are viewed by the scientific realist 
not as necessary truths but as the only scientific explanation of 
the success of science and hence as part of any adequate scientific 
description of science and its relations to its objects. (Putnam, 
1975, our emphasis) 


Note that Putnam speaks of the success of science rather than of the suc- 
cess of an individual scientific theory. He clearly understands the success 
of science as a general and observable phenomenon. Since he obviously 
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would not want to say that each and every scientific theory is always pre- 
dictively successful, he asserts that we find a high success rate of scientific 
theories based on our observations of the history of (mature) science. He 
then infers from the success of science that mature scientific theories are 
typically approximately true, or at least empirically adequate. 

Another early main exponent of the NMA, Richard Boyd (1981, 1983, 
1984), is committed to frequency-based NMA as well. Boyd emphasizes 
that only what he calls the “predictive reliability of well-confirmed scien- 
tific theories” and the “reliability of scientific methodology in identifying 
predictively reliable theories” provides the basis for the NMA. 

It is important to note that the frequency-based NMA is not adequately 
captured by Howson’s reconstruction. An accurate Bayesian reconstruc- 
tion of the frequency-based NMA must include updating under the obser- 
vation of scientific successes and failures in the entire research field. This 
shall be done now. To start, we specify a scientific discipline or research 
field. We count all ng theories in the field that have been empirically tested 
and determine the number ns of theories that were predictively successful. 
We can thus state the following observation O: 


O ng out of ng theories in the research field are predictively successful. 


Let us assume that we are confronted with a new and so far empirically 
untested theory T in that research field. We want to extract the probability 
p(S|O) for the predictive success S of T given observation O. In order not 
to beg the question by assuming predictive success a priori, we assume a 
prior probability p(S) = € where e can be an arbitrarily small number. 

We then assume that each new theory that appears in the research field 
can be treated as a random pick with respect to predictive success. That 
is, we assume that there is a certain overall rate of predictively successful 
theories in the research field and, in the absence of further knowledge, the 
success chances of a new theory should be estimated according to our best 
estimate r of that success rate: 


r= p(S|O) (5.6) 


r will be based on observation O. The most straightforward assessment of 
r is to use the long-run information about the frequency of success in a 
discipline and to identify r with 

its 


freq = 5 


(5.7) 
E 
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Moreover, we make two assumptions similar to the individual-theory- 
based NMA: 


A® : p(S|H,O) is quite large. 
A? : p(S|-H,O) <k «1 


Note that realists assume that the empirical adequacy of T is the dom- 
inating element in explaining the theory’s predictive success. If that is 
so, then S is roughly conditionally independent of O given H (=the- 
ory T is empirically adequate) and we have p(S|H,O) * p(S|H) and 
p(S|-H,O) & p(S|-H). The conditions A? and A? then roughly cor- 
respond to A, and A>. 

We now come to the crucial point of our analysis: accounting for ob- 
servation O blocks the base-rate fallacy. The base-rate fallacy in indi- 
vidual theory-based NMA consisted in disregarding the possibility of ar- 
bitrarily small priors p(H). In the frequency-based NMA, however, the 
crucial probability is p(H|O) rather than p(H). Updating the probability 
of S on observation O has an impact on p(H|O). 


Proposition 5.3 If conditions A° and A are satisfied, then the following in- 
equality holds: 
p(H|O) >r—k, (5.8) 


The frequency-based NMA takes it as a premise, as an observed fact 
about (parts of) mature science, that ns/nz is fairly large. Hence, r is fairly 
large and we can infer from Equation (5.8) that p(H|O) is also substantially 
greater than zero. Thus, the base-rate fallacy is avoided. Note that the 
first and crucial inference in the frequency-based NMA is made before 
accounting for the predictive success of T itself: it relates p(S|O) to p(H|O) 
by the law of total probability. 

However, the final strength of NMA is expressed by the value 
p(H|S,O). In other words, the realist has to show that 


p(HI|S,O) > K, (5.9) 


where K is, as before, some reasonably high probability value. K = 1/2 
may be viewed as a plausible condition for taking the NMA seriously. 
How does a condition on p(H|S,O) translate into a condition on p(S|O) 
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and therefore on the observed success frequency r? First we observe the 
following result: 


Proposition 5.4 


_ p(SIH,0) /_p(S|0) - p(S|-H,0) 
PUHIS,O) = “ (S[0) (sO) pst) 


Then, we observe that p(H|S,O) is decreasing in p(S|H,O) if p(S|O) and 


p(S|7H, O) are held fixed. It thus makes sense to focus on the case where 
it is most difficult for the realist to make the frequency-based NMA work, 
namely the case p(S|H,O) = 1. Then, the following theorem describes a 
sufficient condition for p(H|S,O) to exceed the threshold K: 


Theorem 5.1 If conditions A? and A9 are satisfied, if p(S|H,O) = 1, and the 
inequality below holds: 


p(S|7H, O) 


pCO) eae SEO) 


(5.10) 


then inequality (5.9) is satisfied as well: p(H|S,O) > K. 


For K = 1/2, and using the base rate estimate p(S|O) = ng/nz, this is 
the case if and only if 


=H,O 
jie el ) 


1+ p(S|=H, 0): ot) 


In particular, 2p(S|“H,O) & ns/neg is sufficient for satisfying Equa- 
tions (5.11) and (5.9). Thus, we don’t need an impressively high rate of 
predictive success for a significant argument in favor of scientific realism. 
A defender of the NMA can avoid the base-rate fallacy by taking a global 
perspective on the success of science. We have also argue that this perspec- 
tive is more faithful to the intentions of those who put the NMA forward 
in the first place—namely Hilary Putnam and Richard Boyd. 

All this does not imply that the NMA is valid. A supporter of the 
frequency-based NMA must specify on which grounds she takes a high 
frequency of predictive success in science to be borne out by the data. 
And she must undertake the difficult task of justifying assumptions A? 
and (especially) A9. But we have shown, contra Howson, that the NMA 
still has a fighting chance. For philosophers like ourselves, who are not 
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committed to a particular position in the debate between realists and anti- 
realists, this probabilistic reconstruction of the NMA offers the chance to 
understand the argumentative mechanics behind the realist intuition, to 
better appreciate the context-dependency of the NMA, and to critically 
evaluate the merits of realist and anti-realist standpoints. 


Discussion 


This variation has investigated scope and limits of the No Miracles Argu- 
ment (NMA) when formalized as a probabilistic argument aiming at the 
empirical adequacy of a particular theory T. In the simple probabilistic 
model of the NMA, we have confirmed the diagnosis that it does not hold 
water as an objective argument (Howson, 2000, 2013): too much depends 
on the choice of the prior probability p(H), assuming what is supposed to 
be shown. We have supported this diagnosis by a detailed analysis of the 
probabilistic mechanics of NMA. 

Then, we have shown two possible ways out of the dilemma. First, 
we have investigated how the stability of a scientific theory over time may 
impact the probability that it is empirically adequate. We have shown that 
such observations can greatly increase the range of prior probabilities for 
which the NMA leads to acceptance of T. Second, we have demonstrated 
that Howson’s objection can be mitigated if the base rate of predictively 
successful theories in a specific discipline is taken into account. This is 
also faithful to the line of argument of those scientific realists who think of 
the NMA as a global argument based on the high frequency of successful 
theories in science. In both cases, we have supplemented the classical 
NMA reasoning with novel and distinct kinds of evidence that can be 
embedded into a Bayesian framework. Using our models, the realist thesis 
(or at least the part leading up to empirical adequacy) can be defended 
with much weaker assumptions than in the simple version of the NMA. 

Finally, we should mention the No Alternatives Argument (NAA): the 
claim that the continuous failure to find satisfactory alternatives to a the- 
ory provides evidence for it. In the previous variation, we have shown that 
under plausible assumptions, this observation indeed raises the probabil- 
ity that theory T is empirically adequate. The NAA can also be seen as a 
variation of the NMA: the empirical adequacy of T is the only explanation 
for why scientists have not yet found an alternative. Yet, we have also seen 
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that while this reasoning is valid in principle, it is usually not sufficient to 
push the probability of H (the hypothesis that T is empirically adequate) 
beyond a critical threshold—at least not without additional assumptions. 


This analogy brings us to a project for future research. It would be 
exciting to investigate the parallel between the NAA and the NMA in 
greater detail, and to proceed to a general analysis of argument patterns 
that take non-empirical evidence (in our technical sense of the word) as 
their premise. Second, we have seen that the probabilistic versions of the 
NMA are highly sensitive to p(S|=H) and related quantities that express 
the probability of empirical success if T is not empirically adequate. The 
evaluation of this quantity is itself a point of contention between realists 
and anti-realists: after all, anti-realists often stress the explanatory success 
of false models whereas realists are usually committed to the thesis that 
only true models yield stable empirical success. An investigation of this 
question, supported by case studies, would be highly useful. Third, we 
need to examine whether theories have really been more stable during the 
20th century than before since this is a crucial premise in the probabilistic 
individual-theory-based NMA. For tackling this research question, a com- 
bination of case studies and scientometric analysis (e.g., along the lines of 
Herfeld and Doehne, 2016) strikes us as a promising approach. Fourth, it 
would be good to explore whether the probabilistic NMA can be extended 
into an argument for the full realist position, that is, the view that T is true 
(and not only empirically adequate). At present, we do not see an obvious 
way of doing so—it seems that the argument would just run into the un- 
derdetermination problem—but we invite realists to take our formalism 


and to apply it to a full-fledged defense of the realist view. 


It is noteworthy that all formalizations of the NMA stressed the scien- 
tific track record in the particular discipline to which T belongs. Instead of 
reading NMA as a “wholesale argument” for scientific realism that is valid 
across the board, we should understand it as a “retail argument” (Magnus 
and Callender, 2004), that is, as an argument that may be strong for some 
scientific theories and weak for others. While context-sensitive assump- 
tions are required in our arguments, their relative weakness leaves open 
the possibility of a coherent, non-circular realist position in philosophy 
of science. It also makes the realist argument more sensitive to scientific 
practice which is, ultimately, something that all formal reconstructions of 
scientific reasoning should aim at. Together with the preceding variation, 
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this variation has shown how Bayesian reasoning can model and vindicate 
argument patterns that support the realist hypothesis, and how Bayesian 
models can contribute to a fair assessment of the debate between realists 


and anti-realists. 
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Proofs of the Theorems 
Proof of Proposition 5.1: 


p(>CSH) = zone p(>C|AS) p(S|H) p (IA) 


= Lacs 
j=0 
p(7Cs) p(A)p(>C|AS) p(S|H) p(HIA) 
A,H 
ine p(>C|AS)p (S|) p (HA) Pe PN PPC ney Es) 


2 Cj (s hj + s‘(1 = hj)) 
J= 


With the help of Bayes’ Theorem, these equations allows us to calculate 
the posterior probability of H conditional on C and S: 


p(7=CSH) 
p(-CS) 
LO aj Cj S hj 
Lj=0 Aj Cj (s hj + s'(1 = h;)) 


p(H|-CS) = 


Proof of Proposition 5.2: Assumption B4 is equivalent to the following 
claim: 


p(Ayl (Ax) = p(Ajil V (Ax)) Vj 2 0. 
k=j k=j+1 


which entails that for all j > 0, we have 


pA) p(Ajas) 
P(Vieg (Ax)) PV (Ax)) 


This implies in turn 


OO A 

p(Ajs1) = ra) oy 
1 ee o P(Ax) 
Lys P(A k) 


= p(Aj) 
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By a simple induction proof, we can now show 


n—1 
p(An) = p(Ao) (: =" ss) (5.12) 
k=0 


For n = 1, equation (5.12) follows immediately. Assuming that it holds for 
level n, we then obtain 


) i= Yk=0 p(Ax) 
T= 5, Peay) 


_ a a 1 = Vieeo P(Ax) 
= p(Ao) ( Eo) 1— FE") p(Ay) 


P(Anvi1) = p(An 


k=0 


p(Ao) ( = Ne nas) 


where we have used the inductive premise in the second step. Finally, we 
use straight induction once more to show that 


p(An) = p(Ao) (1 — p(Ao))" (5.13) 


where the case n = 0 is trivial and the inductive step n — n +1 is proven 
as follows: 


p(An+1) = p(Ao) (1- Evian) 


k=0 


= p(Ao) ( ~¥ p(Ao) (1 ~ no) 


k=0 


(iS ei 
— (1— p(Ao)) 


= p(Ao)(1—(1— (1 p(Ao))"**)) 
= p(Ao)(1— p(Ao))"* 


In the second line, we have applied the inductive premise to p(Ax), and 


= p(Ao) (1 — p(Ao) : 7 


in the third line, we have used the well-known formula for the geometric 
series: 
n oe hae 


k=0 1—q 


138 


5.5. Proofs of the Theorems 


This shows (5.13) and completes the proof of the proposition. 


Calculation of the degree of confirmation (Figure 5.6): 


La P(A)p(=C|AS)p (S|) p(HIA) 


Ya P(A) p(AIA) 
E60 aj cj shj 
i=o 4 hj 
ae = = aie 


La P(A)p(=C|AS) p(S| =H) p(-HIA) 


p(7=CSH) 

p(-CS|H) = Po 
p(7=CSH) 

p(-CS|>H) = PO 


Proof of Proposition 5.3: 


and obtain 


La P(A) p(“HIA) 
Li=o aj Cj s! (1 _ hj) 
Lio 4; (1 — hy) 
Tjeo(l ~ a0) ap 
yj-o(1 = ag)! iy 


We first apply the law of total probability 


p(S|O) = p(S|H,O)p 


H|O) + p(S|>H, O)p(-H, O) 


= p(S|H,O)p 
p(H,O) - (p(S|H,O) — p(S|>H, O)) + p(S|-H, O) 


Hence, 


p(H|O) = 


( 
( 


H|O) + p(S|>H, O) - (1 — p(H,O)) 


p(S|O) ~ p(S|>H, 0) 


From Equation (5.14), we derive 


p(H|O) 


IV IV 


V 


»(S|H, O) — p(S|7H, O) ° (5.14) 
p(S|O) — p(S|>H, O) 
p(S|H, O) — p(S|7H, O) 
p(S|O) — p(S|>H, 0) 
— p(S|=H, 0) (5.15) 
p(S|O) — p(S|7H, 0) (5.16) 


p(S|O) —k (5.17) 
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Here Equation (5.16) follows from assumptions A? and A? and Equation 
(5.17) follows from assumption A. This leads directly to the desired 
result: 


p(H|O) > p(S|O) —k = R—k 


Proof of Proposition 5.4: From Bayes’ Theorem and Equation (5.14), 
we infer 
p(H|O)p(SIH, ©) 
p(S|O) 


_ p(S|H,O) »(S|O) — p(S|-H, 0) 


p(H|S,O) = 


Proof of Theorem 5.1: We have assumed that p(S|H,O) = 1. Inserting 
this equality into (5.18) gives us 


a p(S|O) — p(S|=H, O) 
p(H|S;Q).= p(S|O) , ( 1 — p(S|7H,O) ) 


Hence, we can write the condition p(H|S,O) > K as 


1 (p(S,O) — p(S|7H, O) 
Ey | 1— p(|-H, 0) )>« 


Rewriting this inequality a couple of times, we obtain 


1 (p(S|O) — p(S|-H,O) 
IY | 1— p(@|-H, 0) ) 7 


efor MSO) ASI-HLO)) > KO pISI-HL0) 


_ p(S|7H, 0) oe 
“]0) K (1— p(S|-H,0)) 

p(S|7H, O) 

p(S|O) 

p(S|O)-(1—K(1— p(S|7H,O))) > p(S|>H,O) 
p(S|7H, O) 

1—K+K-p(S|-H,0) 


1—K(1—p(S|-H,O)) > 


p(S|O) 


V 


This was exactly one of the assumptions of the theorem. Thus we can infer 
that p(H|S,O) > K. 
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Variation 6: Causal Effect 


From Aristotle to the 21st century, causation is usually treated as a quali- 
tative, all-or-nothing concept. Either C is a cause of E or it is not. However, 
sometimes we have to make more nuanced causal judgments that involve 
a quantitative dimension: C is a more effective cause of E than C’, the 
causal effect of C on E is twice as high as the effect of C’, etc. This is espe- 
cially important for purposes of prediction and evaluating experimental 
findings (e.g., Rubin, 1974; Rosenbaum and Rubin, 1983; Pearl, 2001). For 
instance, a regulatory medical body like the US Food and Drug Adminis- 
tration (FDA) or the European Medical Agency (EMA) only admits a new 
drug to the market if there is a substantial causal effect on recovery rates. 
The effect of immigration on the crime rate shapes political views, creates 
prejudices and affects concrete policy decisions. The effect of the driver’s 
speeding on a traffic accident influences the amount of compensation that 
the victim may receive. All these judgments tap onto the concept of causal 
effect, or equivalently, causal strength or graded causation. 

While a huge amount of literature has been devoted to the qualitative 
question “When is C a cause of E?” (e.g., Hume, 1739; Suppes, 1970; Lewis, 
1973; Mackie, 1974; Woodward, 2003), and the comparative question “Is C 
or C’ a more effective cause of E?” starts to get explored as well (e.g., 
Chockler and Halpern, 2004; Halpern and Hitchcock, 2016), the quantita- 
tive question “What is the causal effect of C on E?” is relatively neglected, 
given the huge scope of actual and potential applications in science. There 
are proposals from different disciplines, such as psychology (Cheng, 1997), 
computer science (Pearl, 2000), statistics (Good, 1961a,b) and philosophy 
(Eells, 1991), but apart from a survey paper by Fitelson and Hitchcock 
(2011), no attempt is made at a unified theory of causal effect. 

Measures of causal effect can differ substantially. Consider a clinical 
trial where the effect of a new drug for treating migraine is compared 
to a control group that receives the standard treatment. The effects are 
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Group/Outcome | Effect No Effect | Total Number 
Treatment A=30 B=90 A+B=120 
Control C=15 D=105 C+D=120 


Table 6.1: The result of a clinical trial where the efficacy of a new migraine 
treatment is compared to a control group. How should the causal effect of 
the treatment be quantified? 


measured on a binary scale: did the pain diminish significantly or not? 
Suppose the results are described by Table 6.1. In the epidemiological 
literature, several measures have been proposed to measure the size of 
such an effect (Davies et al., 1998; Deeks, 1998; Sistrom and Garvan, 2004; 
King et al., 2012): 


Relative Risk (RR) The ratio of the observed relative frequencies of an 
effect in both groups. 


Odds Ratio (OR) The ratio of the odds for an effect in both groups. 


_ A/B 


| ogeeeraee ee 
C/D 


Absolute Risk Reduction (ARR) The difference between the relative fre- 
quencies of an effect in both groups. 


A C 


ARR = 
A+B C+D 


To give an example with the numbers from Table 6.1: The relative risk 
would be RR = 2, meaning that the treatment halves the frequency of 
pain in the affected population. The result looks similar for the odds ratio 
OR = 2.33, but the absolute risk reduction ARR = 0.125 tells a less enthu- 
siastic story: only for 12,5% of the affected population, the new treatment 
makes a difference. This prompts the question of which measure should 
be preferred, and for which reasons (e.g., Stegenga, 2015; Sprenger and 
Stegenga, 2016). Related questions pop up in the psychological literature 
on causal induction (e.g., Cheng, 1997; Sloman and Lagnado, 2015): how 
can we quantify the power of a cause to produce an effect? 

The challenge for a philosophical theory of causal effect is to charac- 
terize these measures and to weigh the reasons for preferring one of them 
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over another. In this variation, we develop axiomatic foundations for mea- 
sures of causal effect between binary variables (e.g., propositions). First, 
we defend the choice of a framework where different measures can be 
embedded and compared: causal Bayes nets. Second, we derive repre- 
sentation theorems for various measures of causal effect, that is, theorems 
that characterize a measure of causal in terms of a set of adequacy condi- 
tions. Third, we compare and discuss these measures with a view towards 
applications: Under which conditions are they invariant? What are benefi- 
cial and what are problematic properties? To the extent that the proposed 
adequacy conditions are found compelling, the technical results have nor- 
mative implications for the choice of a measure of causal effect. Indeed, 
we will make a case for a particular measure as opposed to working with 
a plurality of causal effect measures (see also Sprenger, 2016c). 

Our approach is methodologically innovative in transferring methods 
from probabilistic accounts of confirmation and explanatory power to a 
probabilistic theory of causal effect (e.g., Schupbach and Sprenger, 2011; 
Crupi and Tentori, 2012, 2013; Crupi et al., 2013). Thereby, we also cre- 
ate a bridge between different areas of philosophy of science and formal 
epistemology, and between different parts of this book (—> Variation 2 and 
7). 

The remainder is structured as follows: Section 6.1 motivates the choice 
of causal Bayes nets as a framework for explicating causal effect. Then we 
provide a set of general adequacy conditions in Section 6.2 which are com- 
plemented by more specific conditions in Sections 6.3-6.6. These sections 
also contain the representation theorems. Section 6.7 presents a brief ap- 
plication in medical science while Section 6.8 discusses future research 
questions and concludes. Section 6.9 contains the proofs of the theorems. 


The Framework: Causal Bayes Nets 


When we reason about causes, we often think that they make a differ- 
ence to their effects. Causes which do not matter for the occurrence of 
an effect are no real causes. This thought has already been articulated by 
David Hume (1711-1776) in his famous description of two causally related 
objects: “if the first object had not been, the second never had existed” 
(Hume 1748/77). This line of reasoning is developed in the counterfactual 
and probabilistic accounts of causation (Lewis, 1973, 1979; Reichenbach, 
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1956; Suppes, 1970; Cartwright, 1979). It is also exemplified in many cases 
of scientific inference, such as Randomized Controlled Trials (RCT). There, 
we would like to assess the efficacy of a drug and we divide the trial par- 
ticipants in two groups: one that receives the new drug, and one that 
receives the standard treatment, or a placebo. The causal efficacy of the 
new drug is then a function of the divergence between the results in the 
treatment and the control group, like in the example of Table 6.1. 

With an eye on scientific applications, it is also clear that the envisioned 
account of causal effect should be graded rather than an “all-or-nothing” 
account. Probability is a natural tool which can step in here. After all, 
medical drugs typically raise the probability of recovery; almost none of 
them makes recovery certain. The same can be said about psychologi- 
cal experiments or economic policy decisions: interventions increase the 
frequency of a particular response, but they do not guarantee it. 

Probabilistic measures of causal effect thus nicely square with an ac- 
count of causal relevance where causes raise the probability of the effects. 


On the probabilistic account of causal relevance, C is a cause of E 
if and only if a change in the value of C (e.g., C instead of —C) changes 
the probability that E occurs. This theory captures the basic intuition that 
causes must make a difference to their effects without necessitating them. 
For example, not every regular smoker will eventually suffer from lung 
cancer, but still, we would like to classify smoking as a cause of lung 
cancer. The account of causation as probability-raising gets this example 
right, and many other examples of scientific reasoning as well. 


However, the probabilistic account struggles to distinguish genuinely 
causal relevance from mere statistical correlation, e.g., in a case where 
both variables are correlated as a result of a common cause X. For exam- 
ple, a high crime rate in certain neighborhoods of Dutch cities was found 
to be correlated with a high percentage of migrants living there. How- 
ever, the correlation did not indicate real causation. It could be explained 
away by a common cause: the low socio-economic status of these neigh- 
borhoods (Jensma, 2014). Conditional on the various levels of average 
income, there was no correlation between the crime rate and the number 
of migrants in a neighborhood. The naive probabilistic account gets this 
wrong and still judges the number of migrants to be a cause of the high 
crime rate. To solve this problem, it has been suggested that the putative 
cause has to raise the probability of the effect in all background contexts 
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(e.g., Cartwright, 1979). However, such a condition is very strict: causal 
relations vanish as soon as there is a single background context where the 
probability is lowered. For purposes of control and intervention, such a 
requirement is often impractical. It has also been criticized as failing to 
match our intuitions in causal reasoning (Dupré, 1984; Eells, 1991). 


The interventionist account of causation (Spirtes et al., 2000; Pearl, 
2000) provides an alternative to a purely probabilistic model of causation. 
It is relative to the choice of a causal model M: a directed acyclical graph 
(DAG) G, consisting of a set of vertices (=variables) and directed edges, 
and a probability distribution over the variables in G. The edges represent 
how the effect of an intervention transfers to other variables; conversely, 
lack of a direct connection between variables codifies a conditional inde- 
pendence. On the interventionist account, C is a cause of E if and only if an 
intervention on C causes a change of value in E, or changes the probability 
that E takes a certain value (Woodward, 2012). 


But what is an intervention? An ideal intervention forces a variable C 
to take a certain value while breaking the influence that other variables 
may have on it. Pearl’s notation for such an intervention is do(C = x). 
Formally, this means “lifting C from the influence of the old functional 
mechanism and placing it under the influence of a new mechanism that 
sets the value C = x while keeping all other mechanisms undisturbed” 
(Pearl, 2000, 70, notation changed). See also Spirtes et al. (2000). Imag- 
ine that we would like to study the effects of classroom light on whether 
students are awake or asleep. The intensity of classroom light depends 
on the settings of the audiovisual system. However, we may press the 
light switch manually, overruling the system settings, and then study the 
effects of our intervention on the students (e.g., they wake up from deep 
sleep). This way, we directly intervene on the light intensity and break the 
functional dependency on the preconfigured system settings. 


The interventionist account naturally distinguishes genuinely causal 
relations between C and E from relations where both variables are corre- 
lated as a result of a common cause X. See Figure 6.1. When one inter- 
venes on C, the causal arrow leading from X to C is broken and no effect on 
E occurs. On the probabilistic account, it is less straightforward to express 
this difference since C and E are positively correlated (e.g., Eells, 1991). 
While the probabilistic account describes causation in terms of statistical 
relevance, comparing p(E|C) and p(E|=C), the interventionist account fo- 
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Figure 6.1: A typical common cause (conjunctive fork) structure. An in- 
tervention on C would disrupt the causal arrow leading into this variable 
from X and not have any effect on E. 


cuses on probability of the effect conditional on an intervention on the 
cause, that is, p(E|do(C)). 

In this variation, both perspectives are combined. We express causal 
effect as a function of p(E|do(C)) and p(E|do(—=C)). This means that our 
account of causal strength supervenes on causal models represented by 
causal Bayes nets, already familiar from the introduction and the previous 
variations. In fact, we believe that causal Bayes nets are an intuitive and 
helpful tool in causal reasoning: first, it is easy to spot which interven- 
tions affect which variables; second, they resemble other tools for causal 
inference, such as neural networks or connectionist expert systems. Causal 
inference with Bayes nets, including measuring causal effect, can therefore 
be easily transferred to causal inference in the mind and brain sciences. 

Table 6.2 translates various probabilistic measures of causal effect to 
the causal Bayes nets framework described above. The next sections will 


characterize and compare these measures. 
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Pearl (2000) (C,E) = p(E|do(C)) 
Suppes (1970) 4(C,E) = p(E|do(C)) — p(E) 
Eells (1991) (C,E) = p(E|do(C)) — p(B|do(-C)) 
“Galton” (covariation) j(C,E) = 4p(do(C)) p(do(>C))[p(E|do(C)) — p(E|do(>C))| 


Lewis (1986) 14(C,E) = 


Cheng (1997) 1(C,E) = 


1 — p(E|do(-C)) 


Good (1961a,b) 4(C,E) = 


Table 6.2: Some prominent measures of causal effect. We follow the labels 
of Fitelson and Hitchcock (2011). 


General Adequacy Conditions 


We aim at capturing the size of a causal effect between categorical, binary 
variables. To this end, let £ be a propositional language with variables 
C,E € £. In agreement with the framework presented in the previous 
section, we demand that the causal effect between C and E depend on the 
causal model M, that is, a directed acyclical graph in which C and E are in- 
cluded, with a probability distribution over the variables. This leaves out 
external factors such as typicality, normative expectations and defaults, 
which are of theoretical significance and have been shown to affect causal 
judgments in experimental settings (Knobe and Fraser, 2008; Hitchcock 
and Knobe, 2009; Halpern and Hitchcock, 2016). While this implies that 
our model does not capture all aspects of judgments of causal effect, there 
are many applications (e.g., quantifying effect size in science) where it 
is desirable to eliminate normative considerations, and to derive causal 
effect from observed relative frequencies. Moreover, our approach quan- 
tifies causal effect with respect to a single background context, sidestep- 
ping a substantial discussion in the field of probabilistic causation (e.g., 
Cartwright, 1979; Dupré, 1984; Eells, 1991). 


Formality For two binary variables C, E € £ and a causal model M € 
M, the causal effect of C on E, 7(C,E), is a continuous real-valued 
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function operating on a subset of L? x M, namely the set 
S := {(C,E,M) € L? x M|M contains C and E as variables} (6.1) 


In particular, the causal effect measure 7(C,E) can be represented by 
a continuous function f : [0,1]? > IR such that 


4(C,E) = f(p(C), p(Eldo(C)), p(E|do(-C))) 


This means that 4(C,E) can be expressed as a function of the base 
rate of the cause and the probability of E under the relevant interventions 
on the cause: p(E|do(C)) and p(E|do(C)). This takes up the basic idea 
behind probabilistic relevance accounts of causation. 

Formality is blind to mediator variables or multiple paths leading from 
C to E. Also this choice is conscious. The reason is that mediators are 
sometimes latent variables and not directly measurable. When we admin- 
ister a medical drug C to cure headache E, there are numerous mediators 
in an appropriate causal model that includes C and E. However, the med- 
ical practitioner, who has to choose between different drugs, is mainly in- 
terested in the overall effect that C has on E (how often does the headache 
go away?), not in the details of the causal transmission within the human 
body. Therefore we keep the model simple and amalgamate the effects 
that C may have on E via different paths into one number (e.g., Dupré, 
1984; Eells, 1991). This omission does not rule out a path-specific perspec- 
tive. Investigating path-specific causal effect is an interesting topic and 
relevant for many cases of policy-making and attribution, but it stands or- 
thogonal to our efforts. By appropriate conditionalization on other factors 
in the causal model, any measure of causal effect can be used for calculat- 
ing path-specific effects and comparing them to the net effect (cf., Pearl, 
2001). 

While Formality sketches the ground on which the different measures 
compete, the following adequacy conditions describe how they should 
rank different cause/effect pairs. They describe different ways to think 
about measures of causal effect. 

We start with the case of comparing two putative causes of an effect 
E. Suppose for example that we ask what is a stronger cause of headache 
(E): thinking hard about a difficult research problem (Cj) or going for a 
night of binge drinking (C2)? In such cases, it is natural to answer that C; 
is more effective than C2 if and only if C; makes E more expected than Co. 
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In other words, a cause of an effect is stronger than another cause if it has a 
higher likelihood of producing the effect. Such a requirement is analogous 
to Final Probability Incrementality in Bayesian confirmation theory (Crupi, 
2013; Crupi et al., 2013) that we have encountered in Variation 2. 


Effect Production 


4(Ci1,E) > 4(C2,E) if and only if p(E|do(C1)) > p(E|do(C2)) 


It may be objected that Effect Production neglects the contrastive na- 
ture of measures of causal effect. They should measure the degree of 
causal dependence on C as opposed to —C, that is, the difference that an 
intervention on C makes for E. This aspect gets lost for measures of causal 
effect that satisfy Effect Production: what happens if ~C; or 7C; is the 
case does not matter for calculating causal effect. 

If one follows this argument, one could replace Effect Production by 
anadequacy condition that focuses on the difference that two competing 
causes make for the target effect: 


Difference-Making 
m(C1,E) > 4(C2,E) 


if and only if for a function g : [0,1]? + R which is monotonically 
increasing in the first and monotonically decreasing in the second 


argument: 
8(p(E|do(C1)), p(E|do(>C)))) > g(p(E|do(C2)), p(E|do(>C2))) 


This condition demands in particular that the base rates of C; and C2 
should not matter for ranking their causal effect for an effect E. Instead, 
we only look at the degree to which intervening on C; and C2 makes a 
difference for E. The monotonicity constraint on g expresses the intuitive 
condition that the more likely a cause C is to bring about an effect E, the 
more sizeable its causal effect, all other things being equal, and vice versa 
for =C. 

Another important general property is a symmetry proposed by Fitel- 
son and Hitchcock (2011): the degree to which C prevents E (=the degree 
to which C causes —E) is the negative of the degree to which C causes E: 


Causation-Prevention Symmetry (CPS) 


—19(C,E) = 4(C, 7E) 
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In other words, when C is a strong preventive cause of E, it is just a weak 
cause of —E, and vice versa. CPS is more than a purely ordinal constraint: 
it assigns meaning to the exact numbers yielded by a measure of causal 
effect. Evidently, only measures of causal effect which take both positive 
and negative values can satisfy CPS, and we will often rescale candidate 
measures into a form that satisfies CPS. A purely ordinal version of CPS 
is the following, strictly weaker condition: 


Weak Causation-Prevention Symmetry (WCPS) For two effects E; and 
E2 which are screened off by a common cause C (i.e., Ej ll Ez given 
Cand —=C), 


y(C,E1) =n(C,Ex) ifandonlyif  4(C,7E;) = 4(C,7E2) 


This condition demands that for two equally strong effects of a common 
cause, their negation is also prevented to an equal degree. 

We now proceed to more specific adequacy conditions that charaterize 
an individual measure, or a class of measures that delivers the same rank- 
ings of causal effect. This latter property is called ordinal equivalence of 
measures; it is also familiar from Variation 2. Two measures y and 1’ are 


ordinally equivalent if and only if 
n(C,E) > 4(C,E’) ifandonlyif  /(C,E) > 7'(C,E)). 


The point of the following sections is to bring out the characteristic prop- 
erties of the various available measures, in order to create a basis for com- 
paring, discussing and appraising them. In the end, we will also explain 
our personal preferences and draw some tentative conclusions regarding 
the question of whether we should work with a single causal effect mea- 
sure, or a plurality of measures. 


Causal Production and the Suppes-Pearl Measure 


In this subsection, we derive an axiomatic characterization of the Pearl- 
Suppes measure 4(C,E) = p(E|do(C)) (Suppes, 1970; Pearl, 2000). To 
this end, we introduce a condition which is motivated by the intuition 
that causes produce their effects. Consider Table 6.3 which we already 
know from Variation 2. Three teams in the Italian Seria A, AS Roma, FC 
Internazionale (“Inter”), and Juventus (“Juve”) are still competing for the 
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Rank | Team Points | Team Points 

after 36 out of 38 rounds | after 37 out of 38 rounds 
1 Roma 78 | Inter 79 
2 Inter 76 | Roma 78 
3 Juve 74 | Juve 74 


Table 6.3: A motivating example for Conditional Equivalence. Top of the 
Seria A after 36 and 37 out of 38 rounds, respectively. 


scudetto, the national soccer championship. On the penultimate match day, 
Inter beats Juve in the Derby d'Italia while Roma loses to another team. 
Call this conjunction of events C. Let E; = Inter will win the championship 
and E2 = Roma will be the runner-up. Given C, E; and Ep are logically 
equivalent. (Juve misses four and five points on both teams and cannot 
surpass them any more.) It is now very natural to claim that C causes 
E; and E2 to an equal degree. This intuition is stated in the following 


condition: 


Conditional Equivalence If E; and Ep are logically equivalent given C, 
then (C,E1) = 4(C, E2). 


Taking this condition together with Formality and Effect Production, 
we can prove the following theorem: 


Theorem 6.1 (Representation Theorem for 7sp) All measures of causal ef- 
fect that satisfy Formality, Effect Production and Conditional Equivalence are 
ordinally equivalent to 

nsp(C,E) = p(E|do(C)) 


Pearl (2000, 70) calls nsp(C, E) = p(E|do(C)) the “causal effect” of C on 
E. This measure fits quite well with cases of causal production where we 
are asked to rank causes of an event according to the degree that they pro- 
duced E or were responsible for E. For instance, should a car accident (E) 
be attributed to driving a bit too fast (C1) or to ignoring a red traffic light 
(C2)? Although both causes describe the violation of a norm, one of them 
has a much higher tendency to cause an accident, and p(E|do(C)) seems 
to be a good guide for ranking the causes. This position is also defended 
in two recent papers that apply causal relevance to liability and legal rea- 
soning (Kaiserman, 2016a,b). Note, however, that 7sp violates Difference- 
Making, and it does not distinguish between (positive) causation, causal 


prevention, and causal irrelevance. 
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We now proceed to representation theorems for measures which vi- 
olate Effect Production and satisfy Difference-Making. Given Formality 
and Difference-Making, each of the adequacy conditions discussed below 
is sufficient to single out a measure of causal effect up to ordinal equiv- 
alence. In this specific sense, those conditions are therefore incompatible 
with each other. 


The Multiplicativity Principle and the Difference Mea- 
sure 


How should causal effect combine on a single path, e.g., in the graph in 
Figure 6.2? If causal effect is conceived of as the intensity of a physical 
process linking cause and effect, overall causal effect should be a function 
of the causal effect between the individual links. But which function g : 
IR? + R should be chosen such that 7(C,E) = g(y(C,X),4(X,E))? 


oe 06 


Figure 6.2: A DAG representing causation along a single path. 


A couple of requirements suggest themselves. First of all, g should be 
symmetric: the order of mediators in a chain does not matter. Whether 
a weak link precedes a strong link, or vice versa, should not matter for 
overall causal effect. Second, it seems that the overall causal effect cannot 
be stronger than the weakest link in the chain: If C and X are almost 
independent, it does not matter how strongly X and E are correlated: the 
causal effect will be still weak. Similarly, if both links are weak, the overall 
link will be even weaker. On the other hand, if the link is maximally 
strong (e.g., 7(C, X) = 1), then the strength of the entire chain will just be 
the strength of the rest of the chain. See also Good (1961a, 311-312). 

A very simple function that satisfies all these requirements is multipli- 
cation. Thus, we obtain the following principle: 


Multiplicativity along Single Paths If the variables C and E are con- 
nected via a single path with intermediate node X, then y(C,E) = 
m(C,X) » 1OGE). 
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As a corollary, we obtain that for a causal chain with multiple media- 
tO1s; C8. C > Xp ct A E, 


4(C,E) = 4(C, Xj) - 9(X1,X2) >... +9 (Xp—-1,Xn) + 9 (Xn, E) 


It is now possible to characterize all measures that satsify Multiplica- 
tivity along Single Paths, in addition to Formality and Difference-Making: 


Theorem 6.2 (Representation Theorem for 74) All measures of causal effect 
that satisfy Formality, Difference-Making and Multiplicativity along Single Paths 
are ordinally equivalent to 


qa(C,E) = p(E|do(C)) — p(Bldo(-C)) 


This is a simple and intuitive quantity that measures the causal effect of 
C for E by comparing the effect that different interventions on C have on E. 
It possesses the sine qua non property that two effects in a conjunctive fork 
(e.g., E; - C — Ep) do not cause each other. It is also straightforwardly 
applicable in statistical inference where it is used to quantify effect size 
for categorical variables under an intervention on C. In clinical trials and 
epidemiological studies, 74(C, E) is identical to Absolute Risk Reduction, 
or ARR. 

We also state two notable properties of 74. First, it can be rewritten as 


qa(C,E) = p(Eldo(C)) — p(Eldo(-C)) 
= p(HE|do(-C)) + p(E|do(C)) —1 


Modulo subtraction of a constant, yg(C,E) is a sum of two 
quantities that have been called causal/explanatory necessity and 
causal/explanatory sufficiency by Hempel (1965) and Pearl (2000). The 
names are natural: p(—E|do(—=C),C, E) indicates to what extent C was nec- 
essary for producing E (in a world where C and E are present, what would 
have happened if C had not occurred?), and p(E|do(C), =C, =E) indicates 
to what extent the presence of C was sufficient for producing E (in a world 
where C and E are absent, what would have happened if C had occurred?). 
na(C, E) combines these two plausible ways of thinking about causal effect 
in an intuitive manner. 

While this property may be regarded as superficial, the following one 
is more profound. Consider the proposition E; that a certain real-valued 
quantity E falls into the interval [ey ey | and the proposition E that E 
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has values in [e,,e;]. Obviously, these two propositions are mutually 
exclusive if the intervals are. But what is the degree to which C causes Ej 
or Ep (that is, E € [e,,e/] U[e,,e3])? This question can be answered in 
general: 


Corollary 6.1 For C,E; and Eo € L, 
ya(C,E1 V Ey) = ya(C,E1) + ya(C, Ez) — ya(C, Ei AE2). (6.2) 


In particular, if Ey and Ez are mutually exclusive (that is, if =(E, \ Ez) is a 
theorem), then the above equation reduces to 


na(C, Ey V Ex) = ya(C, Ex) + ya(C, Ez) 


and we can also formulate the following necessary and sufficient condition on 
rankings of causal effect according to nqa(C,E): 


Hal, E; V E>) > qa(C, E}) if and only if na(C, E>) >0, 
and vice versa with E, and Ep» reversed. 


The proof is straightforward and left as an exercise. This means that 
the degree to which a mutually exclusive disjunction of effects is caused is 
the sum of the individual degrees of causation. In particular, causal effect 
is enlarged by disjunctively tacking further effects if and only if each of 
these effects is itself caused to a positive degree. 

This corollary has an interesting implication for causal inference with 
multicategorial variables, such as “place of residence” or “preferred travel 
destination”. Because such variables cannot be measured on a metric scale, 
they are not easy to use in statistical inference. Often, a multicategorial 
variable E € {e1,...,en} is encoded by a series of dummy variables, such 
as E; = +e,, Ex = +e, etc. By describing the causal effect of C on a 
disjunction of several dummy variables E; V FE; V Ex V... V En, Corollary 
6.1 specifies the effect of C on a range of values of a multicategorial variable 
in terms of the effect that it has on the dummy variables F,,..., Ey. 


The No Dilution for Irrelevant Effects Principle and 
Probability Ratio Measures 


What is the causal effect for C on the conjunction of two effects—E; \ Ex— 
when C affects only one of them, and the other effect (say, Ez) is indepen- 
dent of C and of E;? In such circumstances, we may call E an “irrelevant 
effect”. This situation is represented visually in the DAG of Figure 6.3. 
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Figure 6.3: An effect E2 which is irrelevant regarding the causal relation 
between C and £}. 


There are two basic intuitions about what such effects mean for overall 
causal effect: either the causal effect of C is diluted when passing from E; 
to E; A Ep, or it is not. Dilution means that adding E to E; diminishes 
the causal effect of C, that is, 7(C,E1 A E2) < 4(C,E1). A measure is non- 
diluting if in these circumstances, 7(C,E; \ Ex) = y(C,E). This amounts 
to the following principle: 


No Dilution for Irrelevant Effects If E> 11 C, and E> ll E; conditional on 
C and -C, then 4(C, E; \ Ex) = 4(C,E). 


Non-diluting measures of causal effect that satisfy Difference-Making 
can be neatly characterized. In fact, they are all ordinally equivalent to 
Lewis’ probability ratio measure (Lewis, 1986), as the following theorem 


demonstrates. 


Theorem 6.3 (Representation Theorem for 7,) All measures of causal effect 
that satisfy Formality, Difference-Making, and No Dilution for Irrelevant Effects 
are ordinally equivalent to 


p(E|do(C)) 


(CE) = | ildo(-C)) 


and its rescaling to the |—1;1] range 


p(B|do(C)) — p(E|do(=C)) 


(CE) = Celdo(C)) + p(Eldo(aC))" 


To some extent, this result can be interpreted as a reductio of the proba- 
bility ratio measure, and the class of measures that are ordinally equivalent 
to it. After all, given the lack of a causal connection between C and E), it is 
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plausible that C causes Ej /\ E2 to a smaller degree than E,;. The probabil- 
ity ratio measure 7,(C,E), however, satisfies the Principle of No Dilution 
for Irrelevant Effects. In particular, since the probability ratio measure is 
just the Relative Risk measure commonly used in epidemiology, the above 
arguments undermine the use of that measure in clinical practice, too. 

A way around this problem consists in restricting No Dilution for Ir- 
relevant Effects to prevention rather than (positive) causation. According 
to this reading, if C is a strong preventive cause of E’, the degree of pre- 
vention is not diluted by adding irrelevant effects. This may be a bit more 
intuitive than the above principle. After all, lowering degree of prevention 
can be read as increasing causal effect, and why should be able to achieve 
this “for free” just by adding irrelevant effects? Formally, this condition 
reads: 


No Dilution for Irrelevant Effects (Prevention) Let C be a preventive 
cause of E;. If E2 Il C, and E> Il FE; conditional on C and =C, then 
4(C, Ey /\ E>) = y(C,E1). 


The adequacy of No Dilution for Irrelevant Effects is a question that will 
return in Variation 7, when we discuss the principle of Explanatory Justice 
(Crupi and Tentori, 2012; Cohen, 2016a). 

If we combine this restricted version of No Dilution for Irrelevant Ef- 
fects with Weak Causation-Prevention Symmetry, we get an interesting 
result: 


Theorem 6.4 (Representation Theorem for 7) All measures of causal 
strength that satisfy Formality, Difference-Making, No Dilution for Irrele- 
vant Effects (Prevention) and Causation-Prevention Symmetry are ordinally 
equivalent to 


if C is a positive cause of E 


if C is a preventive cause of E 


This measure agrees, for the case of positive causation, with two propos- 
als from the literature. The psychologist Patricia Cheng (1997) derived y-. 
from theoretical considerations about how people perform causal induc- 
tion and called it the causal power of C on E. IJ. Good (1961a,b) derived a 
measure that is ordinally equivalent to 7- from a complex set of theoretical 
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adequacy conditions. Here is Good’s rescaling of 7/-: 


int 

_ — Piao 

Ig(CE) = 4 ce\d0(c)) 
p(E|do(=C)) 


if C is a positive cause of E 


if C is a preventive cause of E 


The two previous theorems elucidate that 7, and 7, are based on the same 
principle: the No Dilution for Irrelevant Effects property. The crucial ques- 
tion which separates the two measures is whether this property should 
hold across the board or just for preventive causes. 


Conjunctive Closure and the Logarithmic Ratio Mea- 
sure 


Consider a variable C that affects a set of other variables FE), Fo,..., which 
would be unrelated to each other, were it not for their common cause C. 
In this scenario, represented visually in Figure 6.4, one could ask how the 
causal effect of C on each individual effect (E;, Ez) affects the causal effect 
that C exerts on the conjunction of these variables. In other words, we 
ask how (C, E; \ Ez) depends on 7(C, E;) and 4(C, Ex), and under which 
circumstances the former is a function of the latter. 


&) © 


Figure 6.4: A typical common cause structure where C screens off the two 
effects E; and Ep. 


A plausible principle for characterizing this dependency is stated be- 
low. It is analogous to the conjunction principle in epistemology, which 
states that justification and/or knowledge is closed under logical conjunc- 
tion. Shogenji (2012) transfers this principle to quantitative measures of 
justification: when (i) the degree of justification of H; and Hp2 given E is 
both equal to ¢ and (ii) H; and H2 are probabilistically independent (un- 
conditionally and conditionally on E), then the degree of justification of 
Hj; A H2 should also be equal to ¢t. Put differently, justification is not di- 
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luted under the conjunction of independent propositions. Shogenji (2012) 
calls this the Special Conjunction Principle. 

Transferred to causal effect, this would mean that the causal effect of 
C on E; A Ey equals the causal effect of C on either E; or Ep if E, ll Ep, 
conditional on C and —C. In other words, causal effect is closed under the 
conjunction of independent effects. Formally: 


Conjunctive Closure If E, 1. E, given C and -C and y(C,E;) = 
y(C, Ez) = t, then also 7(C,E1 A Ez) = t. 


This principle is plausible for doxastic justification and it is appealing 
for causal effect as well. Imagine, for example, that a medical drug has 
two side effects—diarrhea and sore throat—which are independent of each 
other, and that both side effects are caused with the same strength t. It is 
then natural to say that the overall side effect of the medical drug is also 
equal to f since there is no interaction between both effects. Apart from the 
intuitive plausibility, this principle facilitates practical calculations because 
we can often infer the strength of a complex causal effect from the strength 
of the individual effects. 

It is possible to characterize measures which satisfy Conjunctive Clo- 
sure up to ordinal equivalence, similar to how Atkinson (2012) described 
justification measures that satisfy the Special Conjunction Principle. In 
fact, our theorem and proof follows Atkinson’s example quite closely. 


Theorem 6.5 (Representation Theorem for 77;,) All measures of causal effect 
that satisfy Formality, Difference-Making and Conjunctive Closure are ordinally 


equivalent to 
_ log p(E|do(C)) 
mE) og pEIdo(C)) 


We call this measure the Logarithmic Ratio measure since it is based on 
the ratio of logarithms of p(E|do(C)) and p(E|do(—C)), rather than on the 
ratio of probabilities, such as in the Lewis measure. Although this measure 
has not yet been proposed in the literature, it is a serious candidate for a 
measure of causal effect and deserves our attention. 


Application: Quantifying Causal Effect in Medicine 


A natural scientific application of probabilistic measures of causal effect 
consists in quantifying the size of an effect that an intervention on one 
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variable has on another. As we have already said when discussing Table 
6.1, there are several ways of measuring effect size that are employed in the 
medical literature. They can be related straightforwardly to probabilistic 
measures of causal effect when we write the relative frequencies of an 
event as probabilities (e.g., A/(A+B) is just the frequency of E among all 
C’s). In particular, they read as follows: 


_ plE|do(C)) Peer 
RR = p(E]do(3C)) (Relative Risk) 
_ __ P(B|do(C))/p(-E|C) 
O8 = plEido(C))/p(E|do(-C)) aaa 
ARR = p(E|do(C)) — p(E|do(-C)) (Absolute Risk Reduction) 


It is not difficult to relate these measures to the measures we discussed. 
For example, RR is just the familiar probability ratio measure 7,, whereas 
ARR turns out to be the difference measure yg. OR is the product of the 
probability ratio measure 4; and Good’s measure 7,. Also normative ar- 
guments in favor or against causal effect measures carry over to effect size 
measures. For example, Multiplicativity along Single Paths—the defining 
property of 7z—sounds very reasonable in the context of medical infer- 
ence, whereas the No Dilution for Irrelevant Effect property—the defin- 
ing property of 7,—is apparently problematic. Our results may thus be 
used for construing an argument for preferring the Absolute Risk Reduc- 
tion measure ARR over its more popular competitor RR, the Relative Risk 
measure. Our theoretical arguments nicely square with decision-theoretic 
and epistemic arguments for preferring absolute over relative measures of 
risk reduction in medicine, e.g., the neglect of prior probabilities in relative 
measures (Stegenga, 2015; Sprenger and Stegenga, 2016). Without pursu- 
ing this topic in detail—this would deserve a separate paper—it should 
be evident that our analysis of causal effect has important applications in 


scientific inference and medical science in particular. 


Discussion 


This variation has provided axiomatic foundations for a probabilistic the- 
ory of causal effect, proceeding toward a more systematic investigation 
of that topic. It synthesizes ideas from the manipulability /interventionist 
view of causation and the probabilistic relevance view of causation. While 
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causal Bayes nets provide the framework for our analysis, the methods 
for characterizing the various measures are transferred from various parts 
of formal epistemology, in particular Bayesian confirmation theory and 
Bayesian analyses of explanatory power. 


After introducing the conceptual and mathematical framework, we 
have noted that intuitions about measures of causal effect pull into dif- 
ferent directions. This makes it difficult to come up with “the one true 
measure of causal effect”, in analogy to what has been tried in confirma- 
tion theory (Milne, 1996). However, this does not render the project futile. 
By contrast, characterizing each measure by a combination of adequacy 
conditions makes it possible to assess the (possibly context-sensitive) value 
of the different measures by means of assessing the plausibility of the ad- 
equacy conditions. Even if more than a single measure survives the theo- 
retical scrutiny, one can still form informed preferences. Below we make a 
case for the difference measure 174. 


Notably, the measures which we investigated fall into two major cate- 
gories: those who do and those who do not satisfy the Difference-Making 
property (i.e., causal effect is a function of p(E|do(C)) and p(E|do(—C)), in- 
creasing in the first and decreasing in the second argument). Only the first 
measure in our list—the Suppes-Pearl measure ysp(C,E) = p(E|do(C) )— 
fails to satisfy this condition because it does not depend on p(E|do(=C)), 
that is, on the contrastive value that a cause has for an effect. However, 
it may be suitable for quantifying degree of causal production in cases 
of actual causation, when we are more interested in questions of attribu- 
tion and liability than in the predictive value of a cause for an effect (e.g., 
Kaiserman, 2016a,b). The other measures are more straightforward ex- 
pressions of counterfactual dependence: how much does a change in the 
value of C affect the outcome E? See also Beckers and Vennekens (2016) 
for the role of production and dependence in judgments of causation. 


The properties of the investigated measures are summarized in Table 
6.4. It is notable that only two measures (yz and 1.) satisfy the Weak 
Causation-Prevention Symmetry, although this is an eminently sensible 
property. The same can be said about Multiplicativity along Single Paths, 
which is only satisfied by yz. One should add, however, that this property 
is significantly stronger since it also varies among measures in one and the 
same ordinal equivalence class. On the other side, by characterizing the 
r-, Yg- and y--measure mathematically, Theorems 6.3 and 6.4 also point 
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Property 
Measure FORM EP DM WCPS CE MUL NDIE NDIEP CC 
Suppes/Pearl (7sp) yes yes no no yes no no no no 
Eells (74) yes no yes yes no yes no no no 
Lewis (47, 1];") yes no yes no no no yes yes no 
Cheng/Good (ig, 4c) | yes no yes yes no no no yes no 
Log-Ratio (71,) yes no yes no no no no no yes 


Table 6.4: A classification of different measures of causal effect accord- 
ing to the adequacy conditions that they satisfy. FORM = Formality, EP 
= Effect Production, DM = Difference-Making, WCPS = Weak Causation- 
Prevention Symmetry, CE = Coonditional Equivalence, MUL = Multiplica- 
tivity along Single Paths, NDIE = No Dilution for Irrelevant Effects, NDIEP 
= No Dilution for Irrelevant Effects (Prevention), CC = Conjunctive Clo- 
sure. 


out problems with rivaling measures of causal effect (=that they satisfy the 
questionable No Dilution principle). 

All in all, the above analysis provides good grounds for using 744 as a 
default measure of causal effect. Indeed, Pearl (2001) bases his quantifica- 
tion of path-specific effects on 74 as underlying the basic measure of causal 
effect, without justifying this choice further. We are closing this gap. The 
formal analysis also mirrors and supports practice- and decision-oriented 
arguments for 44 vis-‘a-vis its competitors, e.g., in medical science (Ste- 
genga, 2015; Sprenger and Stegenga, 2016). 

What remains to do? First of all, we may aim at generalizing the 
framework from binary variables to categorical and real-valued variables. 
Indeed, many measures of effect size for real-valued variables, such as Co- 
hen’s d or Glass’s A, are based on the difference of group means, and 174 
might be extended naturally into this direction. As long as the cause is a 
binary variable, that is, as long as only two different values of C are com- 
pared, our analysis holds water. The same calculations still apply even if E 
is a real-valued variable. Our approach may thus go a longer way toward 
modeling scientific inference about causal effect than our earlier restriction 
to binary variables may suggest. 

Second, the properties of the above measures, and in particular 4g, in 
complicated networks (e.g., more than one path linking C and E) have 
not been investigated. Is it possible to show, for example, how degrees of 
causation along different paths can be combined in an overall assessment 
of causal effect, e.g., similar to Theorem 3 in Pearl (2001)? 


Third, this work can be connected to information-theoretic approaches 
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to causal specificity (Weber, 2006; Waters, 2007; Korb et al., 2011; Griffiths 
et al., 2015). The more narrow the range of effects that an intervention is 
likely to produce, the more specific the cause is to the effect. How does 
this concept relate to causal effect and to what extent can both research 
programs learn from each other? 

Fourth, we would like to apply this theory to canonical examples in the 
causation literature and to explore whether our understanding of causal 
effect squares well with the significance of normality and norms in causal 
reasoning (Knobe and Fraser, 2008; Hitchcock and Knobe, 2009). 

These are all open and exciting questions, and it is not difficult to come 
up with others. We hope, however, that the results presented herein are 
promising enough to motivate a further pursuit of an axiomatic theory of 
causal effect. 
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Proofs of the Theorems 


Proof of Theorem 6.1: The proof relies on a recent result by Michael Schip- 
pers (2016) in the field of confirmation theory. Schippers demonstrates that 
the following three conditions are necessary and sufficient to characterize 
the posterior probability c*(E,H) := p(H|E) as a measure of degree of 


confirmation, up to ordinal equivalence. 


Formality (Confirmation) There is a measurable function f’ : [0,1]? > 
IR such that for any h,e € £ with probability distribution p(-), 


c(E,H) = f'(p(E), p(H), pC E)). 


Final Probability Incrementality For any sentences H, Fy, and Ey € £& 
with probability measure p(-), 


c(E,,H) > c(E2,H) if and only if p(H\|E1) > p(HIE2). 


Local Equivalence If H; and Hp are logically equivalent given E, then 
c(E,H,) = c(E, Hz). 


It is easy to see that Final Probability Incrementality translates into Effect 
Production when the pair (H, E;,2) is mapped to (E,C,2): 


4(Ci1,E) > y(Co,E) if and only if p(E|C,) > p(E|C2) 


The same is true for Local Equivalence: with (H;2,E) mapped to (E;2,C), 
it postulates that if E; and E> are logically equivalent given C, then 
y(C,E1) = y(C, Ez). This is just the same as Conditional Equivalence. 

Thus it remains to show that Formality (Causal Effect) can be trans- 
formed into Formality (Confirmation) by a suitable change of variables. 
We already know that there exists a f : [0,1]? — IR such that 4(C,E) = 
f(p(C), p(E|do(C)), p(E|do(-C)). Since we only want to characterize f 
mathematically, we restrict ourselves to the case where E is among the 
descendants of C and they share no common causes. We also assume that 
p(C) € (0,1). This allows us to write 


P(EAC) = p(C)p(B|do(C)) p(E) = p(C)p(Bldo(C)) + (1 — p(C))p(E|do(-C)) 


which we can transform into the equations 


_ (EAC) _ P(E) = p(C)p(Eldo(C)) 
p(Bldo(C)) = om p(B|do(-C)) = 1— pO) (6.3) 
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Hence, we can write p(E|do(C)) and p(E|do(=C)) as functions of p(C), 
p(E) and p(EAC). In other words, there is a function f’(p(C), p(E), p(CA 
E)) that characterizes 4 (C, E), namely 


| a pEAC) p(B) ~ p(C)plBleo(C)) 
FPOPE)PCAE) = F (pC), PERE, PE) MOTOS) 
F(p(C),p(Bldo(C)),plE\do(-C) 


= 1(C,E) 


f’ is continuous because f and the functions in Equation (6.3) are. Thus 
we can extend f’ canonically to the set {p(C) € {0,1}}. Hence we can in- 
voke Schippers’ theorem which shows that 4(C,E) = p(E|C) up to ordinal 


equivalence. 


Proof of Theorem 6.2: The proof of this representation theo- 
rem proceeds in several steps. First, we will show that 4(C,E) = 
f(p(C), p(E|do(C) ), p(E|do(=C)) does not depend on p(C). 


©) 


Figure 6.5: A classical collider/joint effect structure in a causal net. 


The proof of this first claim proceeds by contradiction. Consider that 
there are real numbers x1, X2,y,z € [0,1] such that f(x1,y,z) A f(x2,y,Z). 
Then choose E, C; and C2 such that E is a joint effect of C; and C2 with x; = 
(C1), X2 = p(C2), ¥ = p(Eldo(Cy)) = p(Eldo(C2)), z = p(E|do(>Cy)) = 
p(E|do(—=C2)). In this case, Difference-Making tells us that 4(C,,E) = 
4 (Co, E). However, on the other hand, we also know 


4(C1,E) = f (x,y,z) A f(x2,y,z) = 9(C2,E) 


This leads to a straightforward contradiction. Hence, from now on 
we focus on the function g : [0,1] — R such that y(C,E) = 
¢(p(Eldo(C)), p(E|do(C)). 

The second step of the proof consists in deriving the equality 


g(a, &) - 9(B,B) = g(aB + (1—«)B, a6 + (1 —&)B) (6.4) 
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To this end, recall the Bayesian network from the main paper. It is repro- 
duced in Figure 6.6. Again, for the purpose of investigating the formal 
properties of g, we can focus on those cases where p(E|C) and p(E|+C) 


o-oo 


Figure 6.6: The Bayesian Network for causation along a single path. 


agree. 


We know by Multiciplity along Single Paths that 


y(C,E) = 9(C,X) - 9(X,E) 
g(p(X|do(C)), p(X|do(C))) - g(p(E|do(X)), p(E|do(-X))) 
= g(p(X|C), p(X|-C)) - g(p(EIX), p(E|7X)) 


and at the same time, 
n(C,E) = g(p(E|do(C)),p(E|do(-C))) 


g [Frcxicyt61c.9, F pexi-c)p(E1-C.»)} 


Combining both equations yields 
S(P(XIC), POX|AC)) + g(P(E|X), p(E|>X)) 


= g ( woxicrp tei. pat-c)pte-c. x) 


With the variable settings 


equation (6.4) follows immediately. 
Third, we are going to show that 


g(x,y) = a(x —y,0) (6.5) 


To this end, we first note a couple of facts about ¢:° 


3In the proof, negative arguments of ¢ figure. This may look problematic, but it is 
not. We just show that any g(-,-) that satisfies Equation (6.4) on[0,1]* has an extension 
to a function on R? that satisfies certain properties, which can in turn be used for saying 
something about the behavior of g on [0, ie 
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Fact 1 ¢(a,0)9(6,0) = g(a6,0). This follows immediately from equation 
(6.4) with & = B = 0. 

Fact 2 ¢(1,0) = 1. With 6 = 1, the previous fact entails that 
g(a,0)g(1,0) = g(a,0). Hence, either g(a,0) = 0 for all values of 
a (which would trivialize g) or (1,0) = 1. 


Fact 3 ¢(0,1) = —1. Fact 1 entails (with a = B = 0, = B = 1) that ¢(0,1)- 
g(0,1) = g(1,0) = 1. Hence, either ¢(0,1) = —1 or g(0,1) = 1. If 
the latter were the case, then g would take positive values although 
p(E|do(C)) = 0 and p(E|do(-=C)) > 0, in violation of Difference- 
Making. Thus, ¢(0,1) = —1. 


Fact 4 ¢(—1,0) = —1. By Fact 1, g(—1,0) - ¢(—1,0) = g(1,0) = 1. Then 
we apply the same reasoning as in the proof of Fact 3. 


Fact 5 ¢(0,1)-2(B,B) = 9(B,B). Follows immediately from equation (6.4) 
with « = 0,%# = 1. 


These facts will allow us to derive Equation (6.5). Note that (6.5) is trivial 
if y = 0. So we can restrict ourselves to the case that y > 0. We choose the 
variable settings 


Y—-x 
— =0 
7 B 


x=0 p=y 


Then we obtain by means of Equation (6.4) and the previously proven facts 


g(x,y) = g((y—x)/y,0) - g(0,y) 
= g(y—x,0) - g(1/y,0) - g(0,y) (Fact 1) 
= ¢(y—x,0) - g(1/y,0) « g(y,0) - g(0,1) — (Fact 5) 
= ¢(y—x,0) - g(1,0) - g(-1,0) — (Fact 14+3+4) 
= g(x — y,0) (Fact 1+2) 


This implies 


(C,E) = g(p(E|do(C)), p(E|do(-C))) = g(p(Bldo(C)) — p(E]do(-C)), 0) 


Hence, 7(C,E) is a function of p(E|do(C)) — p(E|do(—C)) only. It is easy 
to see that this function must be monotonic, that is, g is monotonically 
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increasing in its first argument. Otherwise there would be x,y € [0,1] 
with x > y and g(x,0) < g(y,0). In that case, application of Equation (6.5) 
and Inference to the Only Explanation yields 


0 > g(x,0) — g{y,0) = g(x—y,0) 20 


and a contradiction results. This concludes the proof of the Theorem. 


Proof of Theorem 6.3: The proof relies on a move from the proof of 
Theorem 1 in Schupbach and Sprenger (2011). Consider three variables 
C, E,; and E, which satisfy the premises of the No Dilution for Irrelevant 
Effects Principle. This means that 


p(E1 A\E2|do(C)) = p(E:|do(C)) - p(Ealdo(C)) 
p(E1 AEg|do(-C)) = p(E:|do(sC)) - p(E2|do(-C)) 
p(E2) = p(E2\do(-C)) = p(E2|do(C)) 


In particular it follows that 


p(E1 A Ep|do(C)) p(E2) p(Ei|do(C)) 
p(E1 AEg|do(-C)) = p(E2) p(Ei|do(-C)) 


According to Formality and Difference-Making, the causal effect measure 
4 can be written as 7(C,E) = g(p(E1|do(C)), p(E1|do(—=C))) for a contin- 
uous function g. From No Dilution for Irrelevant Effects and the above 


calculations we can infer that 


8(p(Ei|do(C)), p(Er|do(-C))) = 4(C,E1) 


Since we have made no assumptions on the values of these probabilities, 
we can infer the general relationship 


g(x,y) = g(cx,cy). (6.6) 


for all 0 < c < min(1/x,1/y). Without loss of generality, let x > y. Then, 
choose c := 1/x. In this case, equation 6.9 becomes 


g(x,y) = g(cx,cy) = g(Ly/x). 
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This implies that g must be a function of y/x only, that is, of the ratio 
p(E|do(C))/p(E|do(-C)). Difference-Making then implies that all such 
functions must be monotonically increasing, concluding the proof of the 


theorem. 


Proof of Theorem 6.4: We begin by showing that 7, and 77, are ordi- 
nally equivalent. For positive causation, 
—P(FE|do(C)) + p(-Eldo(=C)) 1 


p(=E|do(=C)) = GE) 


4c(C, E) = 


and for causal prevention, 


_ plBldo(C)) — p(Eldo(-C)) _, 1 
TE HEldeaC)) aE) 


After these preliminaries, we start with the real proof. The causal effect 


measure 77 that satisfies Formality, Difference-Making, No Dilution for Ir- 
relevant Effects (Prevention) and WCPS can be represented by a function 
g(x,y) with x = p(Eldo(C)) and y = p(E|do(—C)). Suppose that there are 
x > yand x’ > y’ © [0,1] such that (1—x)/(1-—y) = (1-%')/Q1-y’, 
but g(x,y) # g(x’,y’). (Otherwise 7 would just be ordinally equivalent 
to 7g and 7.) In that case we can find a probability space such that 
p(E1|do(C)) = x, p(E1|do(-C)) = y, p(E2|do(C)) = x", p(E2|do(-C)) = y! 
and C screens off E; and E2 (proof omitted, but straightforward). Hence 
n(C,E1) # 4(C,E2). By Weak Causation-Prevention Symmetry, we can 
then infer 4(C, =E,) 4 4(C, Ep). 

But this leads to a straightforward contradiction. After all, for cases 
of causal prevention, we can apply the previous representation theorem 
relating to a conjunction of Formality, Difference-Making and No Dilution 
for Irrelevant Effects. This implies that for cases of causal prevention, 


_ . ( plBldo(©)) 
aes Caneoy 


for some monotonically increasing function f. Hence, 


_.f p(Er\do(C)) 
ane Carseay 


5) 
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and analogously, 


a log p(E2|do(C)) 
y(C,E2) = §( eae 


(22) 


Thus, it follows from 4(C, =E1) 4 4(C, -E2) that 


1-x 1— x! 
er #5 (iy) 
in contradiction with our assumption that (1 — x)/(1—y) = (1—x')/(1- 
y’). Hence, there is a function h such that g(x,y) = h((1—x)/(1—y)), 


and a function h’ := 1/h such that g(x,y) = h'((1—y)/(1—x)). By 
Difference-Making, this function must be monotonically increasing. This 


shows that any causal effect measure that satisfies our four conditions 


must be ordinally equivalent to 7, and hence also to 1c. 


Proof of Theorem 6.5: By Formality and Difference-Making, we have 
that 4(C,E) = g(p(E|do(C)), p(E|do(-=C))) for some continuous func- 
tion g : [0,1] — R. Assume now that 4(C,E1) = y(C,E2) = 1, 
that C screens off E, and E2 and that p(E;|do(C)) = p(E2|do(C)) = x, 
p(E;|do(=C)) = p(E2|do(-C)) = y, for some x,y € IR. By the Conjunctive 


Closure Principle, we can infer 


4(C,E1 AE2) = 4(C, Fi) = g(x,y) 
Moreover, we can infer 


y(C,Ey AEs) = g(p(E1 A E2|do(C)), p(E1 A E2|do(-C))) 
g(p(E1\do(C)) - p(E2|do(C)), p(E1|do(-C)) - p(E2|do(-C))) 
g(x,y") 


Taking both calculations together, we obtain 


g(x,y) = g(x,y) (6.7) 


as a structural requirement on the function g, since we have not made any 


assumptions on x and y. 
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log x 
logy 
and define a function f : IR? — R such that f(x,u) := g(x,y). Equation 


Following Atkinson (2012) and his proof idea, we now define u = 


6.7 then implies the requirement 
F(x u) = g(x 9") = guy) = fxm) 
and by iterating the same procedure, we also obtain 
f(x™,u) = f(x, u) 


for some n € IN. Due to the continuity of f, we can infer that f cannot 
depend on its first argument. Moreover, taking the limit 1 — oo yields 
f(x,u) = f(0,u). Hence, also 


g(x,y) = f(0,u) = f(0, log x/ log y) 


and we see that 


log p(E|do(C)) 
NCE) =H Sreia=C)) 
for some function h : R — R. It remains to show that h is monotonically 
increasing. Difference-Making implies that 7(C,E) is an increasing func- 
tion of p(E|do(C)) and a decreasing function of p(E|do(-=C)). So it must 
be an increasing function of log p(E|do(C))/ log p(E|do(-=C)), too. This 
implies that h is a monotonically increasing function and concludes the 
proof that all measures of causal effect that satisfy Formality, Difference- 
Making and the Conjunctive Closure Principle are ordinally equivalent to 


_ log p(E|do(C)) 
Nec (C, E) = log p(E|do(=C)) 


Variation 7: Explanatory Power 


Explanation is a central element of scientific reasoning. Scientists from 
cognitive science, artificial intelligence and computer science avidly study 
abductive inference, that is, inference where the explanatory value of a 
hypothesis for a set of phenomena obtains special status (e.g., Hobbs et al., 
1988; Bylander et al., 1991; Eiter and Gottlob, 1995; Magnani, 2001; Dou- 
ven, 2011). Statisticians often give maximum likelihood estimates of an 
unknown parameter. In other words, they endorse the parameter value 
that provides the best explanation of the data (e.g., Edwards, 1972; Roy- 
all, 1997). Finally, the concept of explanation is also salient in cognitive 
psychology: explanation-based reasoning affects the way people learn cat- 
egories, generalize properties and draw inferences (Rips, 1989; Thagard, 
1989; Lombrozo, 2006, 2009, 2012). 

Explanation is closely related to other important concepts in scientific 
reasoning, such as prediction, causation, and unification. Explanations dif- 
fer by discipline and context: phenomena are deduced from natural laws, 
unified by new and general theories, produced by causal mechanisms, or 
predicted by statistical models (e.g., Hempel, 1965; Salmon, 1971, 1984; 
van Fraassen, 1980; Machamer et al., 2000; Lipton, 2004; Woodward, 2014). 
There is also a general tension between accounts of explanation that em- 
phasize the predictive value of an explanation for a phenomenon (e.g., 
Hempel and Oppenheim, 1948) and those that stress that explanations 
provide genuine understanding (e.g., de Regt and Dieks, 2005). Given 
this wide variety of explanatory reasoning, it is not easy to give a con- 
vincing analysis of scientific explanation that transcends the scope of a 
particular context and unifies reasoning in different disciplines. A plu- 
rality of accounts of scientific explanation may be more realistic than a 
single account that purports to capture all aspects of scientific explanation 
(Colombo, 2016; Colombo and Wright, 2016). 


Consequently, the place of Bayesian reasoning in a theory of explana- 
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tion will be different than in the case of confirmation, which is essentially 
captured by a probabilistic explication. Rather, our method will mimic the 
previous variation on causal effect. There, we described the quantitative 
dimension of causal judgments—the size of a causal effect—by means of 
a Bayesian formalism without committing us to a particular qualitative 
theory of causation. Similarly, in this variation, we focus on a Bayesian 
explication of explanatory power, that is, the degree to which a hypoth- 
esis explains a phenomenon, without trying to give a complete Bayesian 
account of scientific explanation. In particular, we will not make any at- 
tempts to reduce explanation to probabilistic relationships. Rather, we 
show how explanatory power can be explicated within a broadly Bayesian 
approach to scientific reasoning, and how such an explication can fruit- 
fully inspire further research on scientific explanation. 


There is another reason for why a full Bayesian theory of explanation 
may be difficult to achieve, namely an intrinsic tension between Bayesian 
and explanatory inference. Philosophers such as Gilbert Harman (1965) 
and Peter Lipton (2004) have defended Inference to the Best Explana- 
tion (IBE) as a rational mode of inference: a hypothesis is inferred on the 
basis of its explanatory virtues. For example, evolutionary psychologists 
explain features of human behavior and cognition, such as differences in 
mating or reasoning patterns between males and females, by environmen- 
tal adaptations evolved during the Pleistoscene (Buss and Schmitt, 1993, 
e.g.,). Specific theories, such as Parental Investment Theory or Sexual Se- 
lection Theory (Buss, 1998; Miller, 1998, 2000), are inferred on the basis of 
their ability to explain such differences by means of evolutionary stories. 


However, if we have no further cues why the theory in question may be 
empirically adequate, inferring theories on the basis of their explanatory 
value may just lead us to just-so-stories or improbable conclusions. Cer- 
tainly we should not infer any implausible story about human life in the 
Pleistoscene just on its basis to explain features of current behavior (e.g., 
Gould and Lewontin, 1979). More generally, what are the circumstances 
where explanatory and Bayesian inferences agree? This question is still 
open, as witnessed by a lively debate about whether IBE is compatible 
with, and can be framed in Bayesian terms (van Fraassen, 1989; Okasha, 
2000; Lipton, 2001; Salmon, 2001; Schupbach, 2011b, 2016). For this rea- 
son, a reduction of explanatory to Bayesian reasoning is at least difficult 
to achieve. 
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This variation is structured as follows. First, we motivate why explana- 
tory power should be captured by a statistical relevance measure. In this 
context, we also compare two different approaches to explanatory power: 
those motivated by probabilistic theories of causality and those motivated 
by a structural analogy between prediction and explanation (Section 7.1). 
Second, we compare different statistical relevance explications of explana- 
tory power and develop arguments for a particular measure (Section 7.2). 
Finally, we sketch projects for future research on the integration of proba- 
bilistic and explanatory inference (Section 7.3). 


Toward a Statistical Relevance Account of Explanatory 
Power 


The first characterization of explanatory, abductive inference in terms of 
probabilistic reasoning can be found in the writings of the American prag- 
matist philosopher C.S. Peirce: 


Long before I first classed abduction as an inference it was 
recognized by logicians that the operation of adopting an ex- 
planatory hypothesis—which is just what abduction is—was 
subject to certain conditions. Namely, the hypothesis cannot 
be admitted, even as a hypothesis, unless it be supposed that 
it would account for the facts or some of them. The form of 
inference, therefore, is this: 


e The surprising fact, E, is observed; 
e But if H were true, E would be a matter of course; 


e Hence, there is reason to suspect that H is true. (Peirce, 
1931) 


Peirce’s characterization contains two crucial premises: First, the phe- 
nomenon E is surprising, or expressed in probabilistic terms: p(E) is small. 
Second, given H, E is “a matter of course”, that is, p(E|H) is close to unity. 
If these premises are satisfied, Peirce concludes that “there is reason to 
suspect that H is true’—not necessarily a conclusive reason, but at least 
some reason to accept H. In other words, it is crucial to explanatory infer- 
ence that the surprising fact E is rationalized by H. This feature of good 
explanations is also stressed a couple of decades later by Carl G. Hempel: 
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[T]he [explanatory] argument shows that, given the particular 
circumstances and the laws in question, the occurrence of the 
phenomenon was to be expected; and it is in this sense that the 
explanation enables us to understand why the phenomenon oc- 
curred. (Hempel, 1965, 337, original emphasis) 


And, one page later: 


the explanatory information must provide good grounds for 
believing that X [the explanandum] did in fact occur; other- 
wise, that information would give us no adequate reason for 
saying: “That explains it — that does show why X occurred.” 
(Hempel, 1965, 368) 


Explanation thus has a central epistemic function—namely to resolve the 
epistemic puzzle surrounding the explanandum, to make it a matter of 
course given the explanans. For a recent defense of the predictive value of 
explanations, see Douglas (2009a). 

There is, however, a subtle difference between Peirce and Hempel. 
Peirce explicitly stresses that E must have been surprising beforehand, 
Hempel doesn’t—at least not explicitly. This difference corresponds to the 
choice between two different types of probabilistic explications of explana- 
tory power: one that focuses on the statistical relevance of H for E (e.g., by 
comparing p(E|H) and p(E)), and another that focuses solely on the de- 
gree to which E is expected given H. This is, by the way, the same choice 
that we have already faced in Variations 2 and 6. There, we distinguished 
confirmation as firmness from confirmation as increase in firmness, and 
measures of causal production from measures of counterfactual depen- 
dence. 

Joseph Halpern and Judea Pearl, two renowned researches on causal 
inference, have proposed an explication of explanatory power that pursues 
the second option (Halpern and Pearl, 2005a,b). They observe that causa- 
tion and explanation are a hard-to-separate couple in scientific reasoning. 
Indeed, the most natural and intuitive account of explanation is an account 
where phenomena are explained by their causes, e.g., certain mechanisms: 
causes give an account of how and why the explanandum was produced. 
Failure of the brakes explains a car accident. Poison in the food explains 
the death of the king. Exposure to violent video games explains aggres- 
sive behavior. In all of these examples, causal efficacy grounds explanatory 
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power. So perhaps it is not surprising that until the early 20th century, the 
concept of explanation was subordinate to the concept of causation, e.g., 
in the writings of David Hume or Immanuel Kant. 


In line with these intuitions, the causal theory of explanation sees the 
role of explanations in tracing causal processes and interactions leading to 
the explanandum (e.g., Salmon, 1984; Dowe, 2000; Strevens, 2009). More 
precisely, “the role of explanation is to provide the information needed to 
establish causation” (Halpern and Pearl, 2005b, 897). Halpern and Pearl 
use the interventionist account of causality, presented in the previous vari- 
ation, to redefine the notion of explanatory power: “we view an explana- 
tion as a fact that is not known for certain but, if found to be true, would 
constitute a genuine cause of the explanandum, regardless of the agent’s 
initial uncertainty” (ibid.). Thereby, Halpern and Pearl relativize an ex- 
planation to the epistemic state of an agent and introduce a pragmatic, 
subject-dependent component. They then define that the value of a cer- 
tain variable C = x counts as an explanation of some fact E roughly if and 
only of (i) E holds true in all contexts that the agent regards as possible; 
(ii) C = x is a cause of E; (iii) there are possible contexts where the ex- 
planation is false. The last clause serves to rule out vacuous explanations. 
The goodness of such an explanation is then quantified by the probability 
p(E|do(C = x)), that is, the conditional probability of E given that C takes 
value x. Numerically, the Halpern-Pearl measure is identical to Pearl’s 
measure of causal effect y7sp that we reviewed in Variation 6. 


On the other hand, in the same sense that Halpern and Pearl’s account 
allows for probability-lowering (actual) causes, it allows for probability- 
lowering explanations. This feature is very unintuitive—explanations 
should, as noted by Peirce, make a positive difference to the phenomenon 
they are trying to explain. In particular, the explained phenomenon should 
not be less likely under the explanans than under the alternative hypothe- 
ses. For (actual) causation, this is not necessarily problematic: if a football 
player hits the ball badly and the ball ends up in the goal nonetheless (e.g., 
because the goalie was unprepared), his shot has still caused the goal, even 
if it lowered the probability of a goal compared to a proper shot. But we 
would be hesitant to say that the player’s bad shot explains the goal. On 
this view of explanations, which we will adopt in the remainder, explana- 
tions are (probabilistic) arguments in favor of the explanandum; they are 
prima facie reasons to accept the explanans. 
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Moreover, the Halpern-Pearl account regards explanation as being sec- 
ondary to causation. As a consequence, some pragmatic aspects of ex- 
planation get out of sight. Bas van Fraassen (1980) illustrates this point 
with the famous flagpole story from Bromberger (1965). We can explain 
the length of the shadow of a flagpole by the height of the flagpole, con- 
ditional on the angle of the sun. Prima facie, the reverse explanation does 
not work: the secondary phenomenon, the length of the shadow, does not 
explain the primary phenomenon, the height of the flagpole. But such 
judgments depend on pragmatic factors, as van Fraassen pointed out: in 
a specific context, the height of the flagpole could be explained by the fact 
that it was manufactured to cast a shadow of a certain length at a certain 
time of the day. Sundials work in this way, for example. Depending on the 
context, both explanations can be acceptable, although only one of them is 
properly causal—the other one is functional. 


There is no apparent reason why one of these two explanations should 
be preferred across the board. After all, there is a great plurality of ex- 
planatory reasoning in science, with modes of explanation as different as 
mathematical explanation (Colyvan, 2001), functional and mechanistic ex- 
planation (Machamer et al., 2000; Craver, 2007) and unification (Friedman, 
1974; Kitcher, 1981). It is therefore natural to conceive of explanation as 
a cognitive phenomenon rather than something that is instantiated in the 
real world (e.g., by means of a causal relationship). This observation sup- 
ports the plausibility of the approach to conceptualize scientific explana- 
tions as arguments. Then, the power of an explanation may be measured 
by the degree to which the explanans rationalizes the explanandum. 


In the following section, we ask the question how to quantify the ex- 
planatory power of H with respect to E. We are only concerned with quan- 
tifying the strength of that explanation and presuppose that H qualifies as 
an acceptable explanation of E. This is, of course, no reductive analysis of 
explanation, but it allows us to focus on the “grammar” of explanatory 
power without committing ourselves to a specific (and possibly problem- 
atic) view on the nature of explanation. At the same time, the question 
of explicating the concept of explanatory power is complex enough that 
even in the simple case that H is an undisputed explanation of E, there is 
sufficiently much room for disagreement about the degree of explanatory 


power. 
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Explicating Explanatory Power 


Our basic idea, faithful to the principles of Bayesian philosophy of science, 
is to explicate explanatory power as a function of the joint probability dis- 
tribution of E and H. This is actually in line with Peirce’s rationale that 
explanation proceeds by making a “surprising fact” a “matter of course”. 
Potentially, this can be extended to a causal-explanatory calculus where, 
instead of conditional probabilities such as p(E|H), we reason with coun- 
terfactual probabilities such as p(E|do(H)) and p(E|do(-H)), that is, the 
probability of E given a causal intervention on the putative explanation. 
We leave this question open since it does not directly affect our discussion 
of the various measures of explanatory power. 

Like in previous variations, we assume that E and H are among the 
closed sentences £ of a first-order language L. Analogous to a confirma- 
tion measure, a measure of explanatory power is described by a function 
E: £2x $ > R, where is the set of probability measures on the o- 
algebra generated by £. This function assigns a real-valued degrees of 
explanatory power €(E,H) to any pair of sentences in £, together with 
a probability measure p. For the sake of simplicity, we will omit refer- 
ence to background assumptions and assume that they are implicit in the 
probability function p. 

Three measures of explanatory power have been advanced and dis- 
cussed in recent years. We shall now present them together with the cor- 
responding representation theorems before delving into the issue of com- 
paring the three. We omit Popper’s (2002) measure €(E,H) = (p(E|H) — 
p(E))/(p(E|H) + p(E)) since he provides no independent motivation, and 
the phrase “explanatory power” is used in a heuristic sense only, in the 
context of explicating a measure of degree of corroboration. 

Among the remaining candidates, the oldest measure in the debate is 
the one proposed by I.J. Good (1960) and Timothy McGrew (2003): 


p(E|H) 
p(E) 


The Good-McGrew measure allows for an axiomatic representation, given 


Ecmc(E, H) = log (7.1) 


by Cohen (2016b). To this end, we need to define a number of conditions: 


Formality There is a function g such that, for any E,H € Land any p € f, 
E(E,H) = g(p(EN BH), p(E), p(H)). 
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Formality captures the idea that € is a function of the joint probability 
distribution of E and H. The same idea has been applied successfully to 
measures of confirmation in Variation 2. Next, we have a statistical rele- 


vance condition which rules out probability-lowering explanations: 


Statistical Relevance For any E,H,,H2 € £ and any p € 8, E(E,H;) > 
€(E, Ho) if and only if p(E]H1) > p(E|Hz). 


This condition states, in other words, that among two competing expla- 
nations for the same phenomenon, we should prefer the one which ratio- 
nalizes the explanandum to a higher degree. Note that this is not entirely 
uncontroversial since according to several philosophers of science working 
on explanation (e.g., Okasha, 2000; Lipton, 2004; Schupbach, 2016), degree 
of explanatory power may depend on the goodness of an explanation and 
its plausibility. According to Statistical Relevance, however, even a very 
unlikely explanation is preferred to a likely one, as long as it is better at 
explaining the data. 

The following condition is familiar from the explication of degree of 
confirmation in Variation 2. It forges a link between explanatory power 
and degree of confirmation: H explains E; better than Ep if and only if E; 
raises the probability of H to a higher level than E2 does. 


Final Probability Incrementality For any E),E2,H € £ and any p € ¥, 
€(E,,H) > €(E2,H) if and only if p(H|E,) > p(HIE2). 


According to this condition, explanatory power is structurally similar to 
degree of confirmation in so far as a candidate explanans H performs best 
on those phenomena that are also statistically relevant for it. From these 
assumptions, Cohen (2016b) derives the following representation theorem: 


Theorem 7.1 Formality, Statistical Relevance, and Final Probability Incremen- 
tality hold for a measure of explanatory power E(E,H) if and only if there is a 
strictly increasing function f : IR — RR such that for any E,H € CL and any 
p © 'B, €(E,H) = f(Ecuc(E,H)). 


In other words, the above conditions characterize gmc uniquely, up to 
ordinal equivalence. Cohen’s representation theorem transposes the result 
by Crupi et al. (2013), cited in Variation 2, from degree of confirmation to 
explanatory power. In the original paper, the same conditions are imposed 
in order to derive r(H,E) = p(H|E)/p(H) = p(E|H)/p(E) as a measure 
of degree of confirmation. 
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A similar representation result can be derived for a measure proposed 
by Crupi and Tentori (2012). It takes the form 


p(EIH)—p(E) 
for(EH) =) P®) if p(E|H) > p(E) 72) 
noon PEIH)“P(E) if »(E|H) < p(B) 
PE) P P 


For the €cr-measure, the following condition is characteristic: 


Explanatory Justice If E’ is statistically independent from E, H, and their 
conjunction E / H, then: 


i) if €(E,H) > 0, then €(EA E’,H) < €(E,H); and 
ii) if €(E,H) <0, then €(EAE’,H) = €(E,H). 


This condition is substantial and shall be the subject of debate later on. The 
first clause of Explanatory Justice is taken from Schupbach and Sprenger 
(2011, 115, notation changed), who motivate it as follows: 


“[A]s the evidence becomes less statistically relevant to some 
explanatory hypothesis H (with the addition of irrelevant 
propositions), it ought to be the case that the explanatory 
power of H relative to that evidence approaches the value at 
which it is judged to be explanatorily irrelevant to the evidence 


(€ =0).” 


Schupbach and Sprenger transfer this property to the case of negative sta- 
tistical dependence: addition of statistically independent evidence dilutes 
(negative) explanatory power and brings it closer to the neutral value of 
zero. Crupi and Tentori, on the other hand, think that this property would 
allow “to indefinitely relieve a lack of explanatory power, no matter how 
large, by adding more and more irrelevant explananda, simply at will”. 
(Crupi and Tentori, 2012, 370). Hence their demand for the second clause 
of Explanatory Justice. See also the discussion of No Dilution for Irrele- 
vant Effects in Variation 6. 

The Crupi-Tentori measure €cr also satisfies a similar constraint re- 
garding the relation between positive and negative explanatory power: 


Symmetry For any E;,E:,H € £ and any p ©€ ¥, E(E,,H) > €(Ep,H) if 
and only if €(4E,,H) < €(7E2,H). 
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That is, if H explains E; better than Ez, then it also explains E> better 
than —E;. When Explanatory Justice and Symmetry replace Final Prob- 
ability Incrementality in Theorem 7.1, this suffices for demonstrating an- 
other representation result (Crupi and Tentori, 2012; Cohen, 2016b): 


Theorem 7.2 Formality, Statistical Relevance, Explanatory Justice and Symme- 
try hold for a measure of explanatory power E€(E,H) if and only if there is a 
strictly increasing function f : IR — R such that for any E,H € L and any 
p ©, E(E,H) = f(Ecr(E,H)). 

The primacy of surprise-lowering over the acceptability of the explana- 
tion, as coded in Statistical Relevance, is an important characteristic feature 
of both the Good-McGrew and the Crupi-Tentori measure. It is not shared 
by the third measure in the debate, proposed by Schupbach and Sprenger 
(2011). Their measure has the form 


p(H\E) — p(H|7E) 


Ess(E/H) = CIB) + p(B)" 


(7.3) 


This measure can be derived in different ways. Schupbach and Sprenger’s 
original derivation is based on four conditions. The first one is a variation 
of the Formality condition, which describes their measure as a function of 
the conditional probabilities p(H|E), p(H|—E), and p(E): 


Formality* There is a function g such that, for any E, H € £ and any 
p <8, €(E,H) = g(p(E), p(AIE), p(A|-E)). 


The second one is a weakened version of Statistical Relevance: 


Statistical Relevance* The function g(p(E), p(H|E), p(H|—E)) from For- 
mality* is not constant in the two latter arguments. That is, there is 
no function h : [0,1] + R such that g(x, y,z) = h(x). 


The intuitive idea behind this condition is that explanatory power should 
be sensitive to probabilistic relations between E and H and not be a func- 
tion of the unconditional probability of the explanandum only. Next, we 
have 


Irrelevant Conjunctions If Hp is statistically independent of E, H; and E 
A Hy, then €(E,H;) = €(E,H; \ Hp). 


This condition is similar to the Modularity constraint for measures of con- 
firmation (p. 71). When a scientific hypothesis is irrelevant to a certain 
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explanandum and a putative explanans, adding it to the explanans neither 
increases nor decreases the degree of explanatory power. Unified theories 
just seem to explain a phenomenon as well as that part of the theory that 
did the explanatory work. Or in other words, embedding an explanation 
into a general framework leaves its explanatory power for a phenomenon 
in its original domain unchanged. 


Finally, there is the condition 


Deductive Entailment If =H entails E, then €(E, H) is not sensitive to the 
values of p(H), ceteris paribus. 


While intuitions may not be strong enough to make this condition a com- 
pelling constraint on all measures of explanatory power, it does not seem 
to be implausible or harmful either. One might argue that if +H is already 
a perfect explanation of the explanandum, then the (negative) explanatory 
power of H has nothing to do with its prior probability, but just with the 
degree to which H accounts for E. 

Based on these conditions, Schupbach and Sprenger (2011, Theorem 1) 
prove the following representation theorem: 


Theorem 7.3 Formality*, Statistical Relevance*, Irrelevant Conjunctions and 
Deductive Entailment hold for a measure of explanatory power E€(E,H) if and 
only if there is a strictly increasing function f : IR — R such that for any 
E,H € £ and any p € , E(E,H) = f (Ess(E,H)). 


It should be noted that Ess also satisfies the Symmetry and Statistical 
Relevance conditions (proof omitted), but not Final Probability Incremen- 
tality and only the first, uncontroversial clause of the Explanatory Justice 
condition. The second clause of Explanatory Justice is a major point of 
contention in the debate about different measures of explanatory power, 
as evidenced by Crupi and Tentori (2012) and Cohen (2015, 2016b,a). These 
papers also offer different and somewhat simpler representation theorems 
for Ess. However, the price they pay is that the assumptions have to be 
strengthened a bit. The most interesting alternative characterization (up 
to ordinal equivalence) is due to Cohen (2015) and consists of three condi- 
tions: (i) all tautological hypotheses receive constant explanatory power; 
(ii) the strong symmetry condition €(7E,H) = —€(E,H); (iii) a somewhat 
stronger version of Deductive Entailment. 

We now proceed to a normative comparison of the three measures 
and begin with a critique of the Good-McGrew measure Egyc. For 
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starters, we note that it allows for the conjunction of irrelevant evidence 
(Schupbach and Sprenger, 2011, 114-115). Suppose that for some piece 
of evidence E’, p(E’|E,H) = p(E’|H). In that case, Egyc(EAE’,H) = 
Ecmc(E,H). Schupbach and Sprenger consider this property—possessed 
by neither their measure €ss nor the Crupi-Tentori measure Ecr— 
problematic and illustrate their objection with an example. Let E be an 
observed Brownian motion, let H be an appropriate physical explanation 
of that motion, and let E’ be a proposition about the mating season of the 
American tree frog. Clearly, H explains E much better than it explains E A 
E’—the Brownian motion and the tree frog mating season proposition. A 
substantial part of E A E’ stays unexplained. 


This criticism echoes the paradox of irrelevant conjunctions that we 
have encountered in Variation 2, applied to the ratio measure r(H,E). In 
that case, r allowed for tacking additional (irrelevant) hypotheses without 
lowering the degree of confirmation. This was taken as a reason to rule 
out r as an appropriate measure. The same argument pattern applies 
here: tacking irrelevant conjunctions to the explanandum should lower the 
degree of explanatory power and not leave it constant. 


In defense of Egg, Cohen (2016b) notes that the bite of Schupbach and 
Sprenger’s objection depends on whether E’ is meant to be explained by H 
or not. For example, if E’ is just some extra data obtained in an experiment 
(e.g., demographic data in a psychological survey), then it seems that H 
should not be penalized for failing to explain E’. Whether the addition 
of irrelevant evidence is problematic seems to depend on the focus of the 
explanation: is H supposed to explain all of the evidence or just the part 
that we consider crucial? 

We are not sure that this observation rescues Egjyjc. To accommodate 
this kind of context-sensitivity, we would rather conceive of explanatory 
power as a ternary relation between explanans, explanandum and addi- 
tional data. What we argue here is that explanatory power should not be 
invariant under adding data that are part of the explanandum, but fail to be 
rationalized. Hence, the Good-McGrew €Gmc measure remains problem- 
atic. 

What about the other two measures? Should €ss5 or Ecr be preferred? 
Of course, we are not completely unbiased in answering this question: one 
of the authors of this monograph (J.S.) developed the €ss-measure together 
with Jonah Schupbach. We will now advance two arguments in favor of 
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Ess, both of them due to Cohen (2016a). 


The first argument concerns the scaling properties of both measures. 
It is based on a simple coin-flipping example. There are two identical- 
looking coins, one of which is fair while the other one is biased (say, with 
a 70/30 bias in favor of heads). We test one of the two coins, but do 
not know which one and we consider both cases equally probable. Now 
consider the hypothesis H that the tested coin is biased and the event Ey 
that all N tosses of the coin turn out to be heads. Certainly, this hypothesis 
explains Ey to a certain degree—primarily because that course of events 
would be a truly extraordinary chance under the hypothesis —H. 


However, the Crupi-Tentori measure disagrees: as N_ increases, 
Ect(En,H) quickly approaches zero (e.g, Ecr(E10,H) = 0.014). In other 
words, €cr treats a statistically highly relevant hypothesis as if it were in- 
dependent of the explanandum. E is surprising under H, but it is much 
more surprising under —H, a fact that is not reflected by €cr. By contrast, 
Schupbach and Sprenger’s measure ss converges to a reasonable, but not 
too high value (E55 (En, H) 033), indicating that H outperforms —H 
while being a far from perfect explanation. In other words, €55 captures 
the contrastive nature of scientific explanations (van Fraassen, 1980) better 
than Ecr. 


The second and most stringent criticism is based on how €cr deals 
with irrelevant evidence. If (negative) explanatory power remains con- 
stant under the addition of irrelevant evidence, as the downward clause 
of Explanatory Justice demands, then Crupi and Tentori should also be- 
lieve that Ecr remains constant under the addition of irrelevant disjunc- 
tions to the hypothesis. Hence, they should require that Ecr(E,H) = 
Ecr(E,H V H’) whenever H’ is statistically independent of E, H and E 
/ H and p(E|H) < p(E). However, Cohen (2016a, Claim 1) shows that 
in this case, Ecr(E,H) < Ecr(E,HVH’). This leads the entire idea of 
Explanatory Justice that motivated the Crupi-Tentori measure, ad absur- 
dum: explanatory power is not increased for E / E’, but it is increased 
for H V H’. This internal inconsistency can be construed as an argument 
in favor of the Schupbach-Sprenger measure €s5. Those who fancy the 
Crupi-Tentori measure may try to evade this objection by making suitable 
modifications for the case of negative explanatory power. This is a topic 
for future research, though. 
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Discussion 


This variation motivated, presented and compared various Bayesian ac- 
counts of explanatory power. For starters, the two grand traditions for con- 
ceiving of scientific explanation—the view of explanations of arguments 
and the causal-interventionist view—have been introduced and discussed. 
In the light of that exposition, it seems that none of the views captures 
completely what scientific explanations are about; yet, each of the views 
retains important features of scientific explanation that can be used for an 
explication of explanatory power. 


The core of this variation has been the derivation and comparison of 
three different Bayesian measures of explanatory power. In our view, the 
results favor the Schupbach-Sprenger measure €ss5. The competitors, the 
Good-McGrew measure Egyc and the Crupi-Tentori measure €cr, are 
haunted by general objections pertaining to their functional form that are, 
at least as things stand now, not easy to answer. Context-specific consid- 
erations may have the last word in each application, however. 


Another dimension of investigating measures of explanatory power 
consists in empirical work. In an experiment that transfers the design of 
Crupi et al. (2007) to the case of explanatory power, (Schupbach, 2011a) 
has found out that &s5 best describes participants’ judgments of explana- 
tory power. For methodological criticism of this design and the statistical 
analysis, see Glymour (2015). Recent experiments on explanatory power 
and related cognitive values (e.g., Colombo et al., 2016a,b) confirm that 
explanatory judgment is sensitive to statistical relevance, lending empir- 
ical support to the Bayesian research program on explanatory reasoning 
and explanatory power. The above studies also revealed a strong link be- 
tween judgments of explanation, confirmation and rational acceptability. 
Furthermore, Lombrozo (2007) has investigated how the simplicity of an 
explanation affects its perceived value. Future studies could transfer this 
design from Lombrozo’s artificial and idealized scenario, involving inhab- 
itants of an alien planet, to an ecologically more valid setting. 

More specifically, measures of explanatory power can help to construct 
a Bayesian account of Inference to the Best Explanation, and to develop a 
mathematically precise version of IBE. From a descriptive point of view, 
recent empirical work has shown that people accept hypotheses rather on 
the basis of their explanatory value than on the basis of objective chances 
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Douven and Schupbach (2015b,a). This motivates further empirical re- 
search into the circumstances under which people’s reasoning conforms 
to IBE. From a normative point of view, Schupbach (2011b, 2016) has used 
computer simulations in order to show that IBE—conceptualized as infer- 
ence to the hypothesis with the highest explanatory power—is a reliable 
mode of inference. Peirce’s inference scheme (E, H explains E > H) is 
replaced by the scheme (E, €(E,H) > €(E,Hi) for all alternatives Hj => 
H). This sophisticated form of IBE approximates Bayesian reasoning very 
well, and in Schupbach’s simulations, the explanatorily most valuable hy- 
pothesis matches the true hypothesis in an overwhelming number of cases. 
More research along these lines may help to determine the conditions un- 
der which IBE is a sound form of scientific reasoning, and to shed light 
on issues where IBE takes a prominent role, such as the ongoing debate 
between realists and anti-realists. 

Finally, there is ample room for combining empirical and theoretical re- 
search on explanatory inference in the Bayesian paradigm. A particularly 
salient issue concerns the role and interplay of causal and probabilistic 
factors in explanatory reasoning. One could, for example, envision a sys- 
tematic comparison of measures of causal effect and explanatory power. 
Is there an isomorphism between some measures of explanatory power 
and causal effect, based on their joint interest in the predictive power of 
the explanans (=the cause) for the explanandum (=the effect)? How do 
explanatory, causal and probabilistic reasoning interact (Lombrozo, 2009, 
2011, 2012; Sloman and Lagnado, 2015)? Do these differences have corre- 
lates on the level of a formal Bayesian analysis? 

We hope that our contribution will stimulate further research on the 
nature of explanation. In particular, we hope that our results will help 
to promote “the prospects for a naturalized philosophy of explanation” 
(Lombrozo, 2011, 549), where philosophical theorizing about the nature of 
explanation is constrained and informed by empirical evidence about the 
psychology of explanatory power and where, on the other hand, philo- 
sophical research stimulates empirical investigations into explanatory rea- 


soning. 
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Variation 8: Intertheoretic Reduction 


Establishing relations between different theories is an important goal of 
science. Unified theories with a wide scope and a small number of ba- 
sic postulates have been found attractive by scientists at all times. Take, 
for example, Newtonian mechanics which can be used to explain terres- 
tial as well as celestial motion, unifying Galilei’s invariance principle with 
Kepler’s Laws of planetary motion. Or consider Maxwell’s theory of elec- 
trodynamics which provides a unified account of electric and magnetic 
forces, and the laws governing their interaction. There is also the famous 
example of statistical mechanics, whose micro-level laws about the motion 
of molecules provide the foundations for a macro-level theory about the 
behavior of gases and fluids, namely thermodynamics. 

The relation between statistical mechanics and thermodynamics is spe- 
cial because it is a paradigm example of intertheoretic reduction: account- 
ing for the behaviour of a system at a certain level of organization by 
describing the behavior of its constituents. What exactly is involved in 
a reduction is a matter of philosophical controversy (see van Riel and 
Van Gulick, 2014). The basic idea is that the concepts and laws of a 
phenomenological theory Tp, such as thermodynamics, are “reduced” 
to laws of a more fundamental theory Tr, such as statistical mechanics. 
Often this reduction is executed by means of deriving the laws of Tp from 
those of Tp (Nagel, 1961; Schaffner, 1967)—more on this below. Following 
standard terminology, we say that Tp is the reduced theory and that Tr 
is the reducing theory. Other examples of (putative) intertheoretical re- 
ductions are chemistry to atomic physics, rigid body mechanics to particle 
mechanics, psychology to neuroscience, and agent-based modeling in the 
social sciences. 

Reductions are, if successful, celebrated by scientists because they al- 
low for a unified theoretical framework in which one can investigate the 
phenomenological as well as the fundamental theory. They also allow for 
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precise predictions on the phenomenological level motivated by assump- 
tions on the fundamental level. They may provide some deep understand- 
ing into and explanation of the nature of central concepts of the involved 
disciplines. For instance, the thermodynamic concept of heat is identified 
with the energy transfer by a disordered, microscopic action on a system of 
molecules, described by statistical mechanics. For these reasons, interthe- 
oretic reductions are taken to make large contributions to the cognitive 
advancement of science. 

In this variation, we show how the establishment of intertheoretic re- 
ductions boosts the cognitive value of the involved theories by confirming 
them in the Bayesian sense. More specifically, we show that if there is a re- 
ductive relation between two theories, then confirmation flows both from 
the phenomenological to the fundamental theory and from the fundamen- 
tal to the phenomenological theory. For instance, evidence that exclusively 
confirms statistical mechanics before the reduction also confirms (though 
perhaps to a lower degree) thermodynamics after the reduction, and vice 
versa. 

Section 8.1 sets the scene by outlining the Generalized Nagel-Schaffner 
(GNS) model of reduction, which serves as the foil for our Bayesian analy- 
sis. Section 8.2 contains the main argument: we consider the confirmation 
of Ty and Tp in two scenarios: one with and one without intertheoretic 
reduction. We conclude that reduction boosts confirmation, and Section 
8.3 discusses various implications of this result. In Section 8.4, we sum 
up our results and outline a number of open problems. Further detail 
is contained in the articles “Who is Afraid of Nagelian Reduction?” and 
“Confirmation and Reduction” by Dizadji-Bahmani et al. (2010, 2011), on 
which this variation is based. 


The Generalized Nagel-Schaffner Model 


For describing the idea behind the GNS model of reduction which mo- 
tivates our later Bayesian analysis, it may be useful to begin with the 
familiar case of thermodynamics and statistical mechanics. Thermody- 
namics describes systems like gases and solids in terms of macroscopic 
properties such as volume, pressure, temperature and entropy, and gives 
a correct description of the behaviour of such systems. The aim of sta- 


tistical mechanics is to account for the laws of thermodynamics in terms 
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of dynamical laws governing the microscopic constituents of macroscopic 
systems (Frigg, 2008). In particular, statistical mechanics aims to show that 
the Second Law of Thermodynamics is a consequence of the mechanical 
motion of the molecules of the gas. For example, consider a container di- 
vided in two by a partition wall. The left half is filled with a gas, while the 
right half is empty. If we now remove the partition, the gas will spread 
and soon be evenly distributed throughout the entire container; the gas’s 
entropy increases as it spreads. This is an instance of a process obeying 
the Second Law of Thermodynamics. Roughly speaking, the Second Law 
says that the entropy of a closed system cannot decrease, and usually in- 
creases when the system is left on its own in a non-equilibrium state. The 
aim of statistical mechanics is to account for the Second Law in general 
in terms of the equations governing the motion of the molecules of the 
gas and some probabilistic assumptions; that is, it aims to show that the 
Second Law is a consequence of its basic postulates. Or almost so. 

That analogues of the laws of the phenomenological theory (here: ther- 
modynamics) should follow from the laws of the fundamental theory 
(here: statistical mechanics) is the basic idea of GNS. Consider a phe- 
nomenological theory Tp and a fundamental theory Tp, which are identi- 
fied with a set of empirical propositions. So let Tp := Cie, sand T)} and 
Tr i= (TM), oa To) }. The reduction of Tp to Tr consists of the following 
three steps (Schaffner, 1967): 


1. Adopt auxiliary assumptions describing the particular setup under 
investigation. Here, these are assumptions about the mechanical 
properties of the gas molecules. Then derive from these and Tr a 
restricted version of each proposition TY, Denote these by TA) and 
the corresponding set by Tf := (1) v4 ee 


2. Tp and Try are formulated in different vocabularies. In our ex- 
ample, statistical mechanics talks about trajectories in phase space 
and probability measures while thermodynamics talks about macro- 
scopic properties such as pressure and temperature. In order to con- 
nect the two theories, we adopt bridge laws. These connect terms of 
one theory with terms of the other, for instance mean kinetic energy 
in statistical mechanics with temperature in thermodynamics. Sub- 
stituting the terms in T; with terms from Tp as per the bridge laws 
yields Tj, i-e., the set (ry et prac 
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3. Show that each element of T} is strongly analogous to the corre- 
sponding element in Tp. 


sas Derivation fe Bridge Laws : Strong 
Boundary ahs fhe Ty 
in Analogy 
Conditions 


Figure 8.1: The Generalized Nagel-Schaffner (GNS) model of reduction 


If these conditions obtain, we say that Tp is reduced to Tp. See Figure 8.1 
for a graphical illustration. 

We now explain two central notions that occur in the GNS model of 
intertheoretic reduction. 

First, the notion of strong analogy, which may appear inappropriately 
vague. After all, Nagel himself has stressed the importance of logical and 
mathematical relations that hold between the reducing and the reduced 
theory. These strong links between Tp and Tp seem to be watered down 
by introducing a concept which introduces a great deal of subjective judg- 
ment on behalf of the scientist. However, it is often impossible to derive 
the exact laws of Tp. For instance, it is not possible to derive the exact 
Second Law of Thermodynamics from statistical mechanics, which is a 
probabilistic theory, whereas the Second Law is supposed to hold without 
exception. Thus exact derivability is too stringent a requirement. It suffices 
to deduce laws that are approximately the same as the laws of Tp. For the 
case of statistical mechanics and thermodynamics, we derive a probabilis- 
tic law that is strongly analogous to the Second Law of Thermodynamics: 
namely the proposition that entropy is highly likely to increase over time, 
which is known as Boltzmann’s Law. This revision of the original model 
has been developed in a string of publications by Schaffner (1967, 1969, 
1976, 1977, 1993), and, indeed, by Nagel (1979) himself. In sum, reduction 
is the deductive subsumption of a corrected version of Tp under Tr, where 
the deduction involves first deriving a restricted version, Tj, of the reduc- 
ing theory by introducing boundary conditions and auxiliary assumptions 
and then using bridge laws to obtain Tp from Tp. 

This brings us to the second point: the notion of a bridge law. While 
Nagel himself remains relatively non-committal about the exact form and 
nature of bridge laws, Schaffner (1976, 614-615) offers a concise charac- 
terization of bridge laws, which he calls reduction functions. For Schaffner, 
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a reduction function is a statement to the effect that a term yp of Tp and 
a term yf of Tp are coextensional. For example, the terms “temperature” 
and “mean kinetic energy” are coextensional when applied to a gas (we 
come back to this qualification below). At least in physics, properties usu- 
ally have magnitudes: A gas does not have a temperature simpliciter, it has 
a temperature of so and so many degrees Kelvin. Thus, a bridge law does 
not only establish coextensionality; it also specifies the functional relation- 
ship between the magnitudes of the terms and the units of measurements. 
That is, the bridge law contains a function f such that tp = f (tr), where, 
respectively, tp and Tr are the values of yp and yr. So we can give the fol- 
lowing tentative definition of bridge laws (we will qualify this statement 
below): A bridge law is a statement to the effect that (i) yp applies if, and 
only if, yp applies, and (ii) tp = f(T). 

Both the concept and the epistemology of strong analogy and bridge 
law have served as the basis for criticism of the GNS model of reduc- 
tion. For instance, the so-called New Wave Reductionists (e.g., Church- 
land, 1979, 1985; Bickle, 1998) deny that bridge laws play an important 
role in the discovery of reductive relations. However, we do not want to 
engage (again) in a debate about the merits and drawbacks of the GNS 
model, but to show how it can be used for demonstrating the confirma- 
tory value of reductive relations. Therefore we refer the interested reader 
to Dizadji-Bahmani et al. (2010), where these and similar criticisms are 
addressed and, to our mind, convincingly rebutted. 


Reduction and Confirmation 


Consider how theories are supported by evidence. With regard to our two 
theories Tp and Tg, there are three kinds of evidence: evidence that only 
confirms the phenomenological theory, evidence that only confirms the 
fundamental theory and evidence that confirms, to some degree, both. We 
make this clear with examples from thermodynamics and statistical me- 
chanics. For the first case, consider what is known as the Joule-Thomson 
process: there are two chambers of different dimensions connected to each 
other by a permeable membrane, filled with a gas. At the end of each 
chamber, there is a piston which allows the pressure and volume for the 
gas in each chamber to be varied by applying a force. The pressure in the 
first chamber is higher than the pressure in the second. Now push the 
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gas from the first chamber into the second, but so slowly that the pressure 
remains constant in both chambers and no heat is exchanged with the en- 
vironment. Then, the gas in the second chamber cools down. The amount 
of cooling can be calculated using the principles of thermodynamics, and 
is found to coincide with experimental values. So we have a confirmation 
of thermodynamics, but not of statistical mechanics since no statistical 
mechanics assumptions have been used in the argument. For the second, 
consider the dependence of a metal’s electrical conductivity on tempera- 
ture. From statistical mechanics, one can derive an equation relating the 
change in the electrical conductivity of certain metals given a change in 
temperature which is what one finds in experimental thermodynamics, in 
contrast, is entirely silent about this phenomenon. Third, consider again 
the gas confined to the left half of the box which spreads evenly when 
the dividing wall is removed. It follows from thermodynamics that the 
thermodynamic entropy of the gas increases; at the same time, it is a con- 
sequence of statistical mechanics that the Boltzmann entropy increases in 
that process. So the spreading of the gas confirms both statistical me- 
chanics and thermodynamics. We shall now explicate this intuition in a 
Bayesian model. 


Before the Reduction 


We examine the situation before a reduction is attempted. To simplify 
things, we assume that Tp and Try have only one element, viz. Tp and Tp 
respectively. The generalization to more than one element is conceptually 
straightforward. Furthermore, E confirms Ts and Tp, Ep only confirms Tp 
and Ep only confirms Tp. Introducing corresponding propositional vari- 
ables Tr, Tp, E , Er and Ep, we can represent the situation before the 
attempted reduction in the Bayesian network depicted in Figure 8.2. 

Following our methodology, we have to specify the prior probabilities 
of Tp and Tp (i.e., of all root nodes) and the conditional probabilities of E, 
Ef and Ep (i.e., of all child nodes), given their parents. We denote: 


pp)=tr , plTp)= 
p(Er|Tr)=pr, ee = 9F 
p(Ep|Tp) = pe  , p(Ep|-Tp) = 4p (8.1) 
p(E|Tz,Tp) =a , p(E|Tp,-Tp)=8 
( 


p(E|-Tr, Tp) = 7 Fi Pp E|—Te, aTp) = = é} 
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Figure 8.2: The Bayesian network representing the situation before the re- 
duction. 


These parameters cannot be freely chosen as we assume that the following 
conditions hold: First, Ep confirms Tg, hence pr > gr. Second, Ep confirms 
Tp, hence, pp > gp. Third, E confirms Tp and fourth E confirms Tp. The 
last two conditions entail the following constraints on «,6,y and 6 (all 
proofs are in the final section): 


(= p)tetyHote > 0 (8.2) 
(a—y)tp+(B—64)tp > 0 (8.3) 


These inequalities hold, for example, if « > 6, y > 6, which seems to be a 
natural condition. One may also want to require that p(Tp|E, Er) > p(Tr) 
and p(Tp|E,Ep) > p(Tp). Note, however, that both inequalities follow 
from the above four conditions (proof omitted). 

Given this network structure and the conditional independences en- 
coded in it, it is easy to see, for example, that the variable Er is indepen- 
dent of Tp given Tr and that Ep is independent of Tr given Tp. In symbols: 


El Tp|Tr , EptLT¢|Tp (8.4) 


Hence, Er does not confirm (or disconfirm) Tp and Ep does not confirm 
(or disconfirm) Tr: 


p(Tp|Er) = p(Tp) , p(Tr\Ep) = p(Tr) (8.5) 


We conclude that there is no flow of confirmation from one theory to the 
other. The intuitive reason for this is that there is no chain of arrows 
from Ef to Tp. Note also that the variables Tr and Tp are probabilistically 
independent before the reduction: 


p(Tr, Tp) = p(Tr) p(Tp) = te te (8.6) 
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All this may, however, not be right in practice. Scientists may feel, 
for example, that the two theories are much more intimately connected. 
An indication for this may be that there is, as we assume, evidence E 
that supports both theories. Another reason may be that there are formal 
(or other) relations between the two theories. In this case, scientists will 
attempt to reduce one theory to the other. Let us now model this situation. 


After the Reduction 


Recall the three steps involved in reducing one theory to another set out 
in Section 8.1: First, derive Tj from the auxiliary assumptions and Tp. 
Second, introduce bridge laws and obtain T; from Tj. Third, show that T5 
is strongly analogous to Tp. 


Figure 8.3: The Bayesian Network representing the situation after the re- 
duction. 


The situation after the reduction can then be represented in the 
Bayesian network depicted in Figure 8.3. To complete the network, we 
specify the following conditional probabilities: 


p(Tp|Tp)=pp  ,  p(Tre|-Tp) = 45 (8.7) 
p(TglTr)=pp  , — p’'(Te|-Tr) = 4 


Note that Equation (8.7) replaces the second equation in first line of Equa- 
tion (8.1). We also have to represent the bridge law in probabilistic terms. 
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Naturally, we require: 
p(Tp|Te) =1 , p’(Tp|>Tp) = 0 (8.9) 


All other probability assignments hold as in the case of P;. Requiring this 
condition makes sure that we can compare the two scenarios later, i.e., the 
situations before and after the reduction. 

Three remarks about the three steps in the reduction are in order. First, 
Tz may be more or less good. How good it is depends on the context (i.e., 
the application in question and the auxiliary assumptions made) and on 
the judgment of the scientists involved. In line with our Bayesian ap- 
proach, we assume that the judgment of the scientists can be expressed 
in probabilistic terms. Second, the move from Tj; to Tp in virtue of the 
bridge laws may be controversial amongst scientists. Whilst bridge-laws 
are non-conventional factual claims, different scientists may assign dif- 
ferent credences to them. Third, what counts as strongly analogous will 
also depend on the specific context and on the judgment of the scientists. 
For example, whether entropy fluctuations can be neglected or not cannot 
be decided independently of the specific problem at hand, see Callender 
(2001). All this fits our Bayesian account well. 

Note that, in the Bayesian network in Figure 8.3, there is now a direct 
sequence of arrows from Tr to Tp: the path through Tf to T5. And hence, 
we expect that Ep is now probabilistically relevant for Tp and that Ep is 
now probabilistically relevant for Tp. And this is indeed what we find: the 
independencies formulated in Equation (8.4) do not hold any more. We 
state our results in the following two theorems: 


Theorem 8.1 Ex confirms Tp iff (pe — qe) (pe — 9) (Pp — 9p) > 0. 


This theorem entails that Ep confirms Tp if the following three conditions 
hold: (i) Ep confirms Tr (ie., pr > qr), (ii) Tp confirms TF (ie.,p7 > qF), 
and (iii) T5 confirms Tp (i.e., pp > qp). These conditions are immediately 
plausible. Condition (i) was assumed from the beginning, and conditions 
(ii) and (iii) make sure that there is a positive flow of confirmation from 
Tp to TE = Tp (qua bridge law) and from Tj to Tp. 


Theorem 8.2 Ep confirms Tr iff (pp — qp) (pe — 92) (pp — 9) > 0. 


This theorem is analogous to the previous theorem. It entails that Ep 
confirms Tr if the following three conditions hold: (i) Ep confirms Tp 
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(i.e..pp > qp), (ii) Tp confirms TF (i.e.,p- > gz), and (iii) Tp confirms Tp 
(i.e.,pp > Gp). 

Note that, in our representation, the bridge law states a perfect correla- 
tion between TF and Tp. A bridge law is posited by scientists working in a 
particular field, and it may happen that not everybody in that community 
is convinced of it. Thus, different scientists may assign different credences 
to a particular bridge law. In a case where a lower probability is assigned 
to a bridge law, the reduction may still be epistemically valuable — the 
flow of confirmation will just be less. How much confirmation will flow 


depends, of course, on the values of the relevant probabilities. 


For future reference, let us calculate the prior probability of the con- 
junction of both theories. We obtain: 


p' (Tr, Tp) = te (pe Pp + Pap) (8.10) 


In a similar way, we may calculate the posterior probability of both 
theories given the total evidence, i.e., the expression p'(Tp, Tp|E, Er, Ep)— 
see Section 8.5. 


Finally, let us remark on the specific representation we have chosen in 
the Bayesian Network in Figure 8.3. Clearly, having a sequence of arrows 
from Tr to Tp ensures that confirmation can flow from one theory to the 
other. However, this sequence of arrows is not just driven by our wish to 
establish a flow of confirmation from the reducing to the reduced theory: 
it makes scientific sense. First, Tf is an approximation of Tp. It follows 
from it and depends on it, hence the direction of the arrow. Second, we 
have drawn an arrow from T; to Tp although the propositional variables 
in question are, qua the bridge law, intersubstitutable with each other. This 
is modeled by assigning appropriate conditional probabilities. The arrow 
could have also been drawn from T> to T;. In this case we had to require 
P(T;|T5) = 1 and P(T;|=T}) = 0. These conditions are, however, equiva- 
lent to Equations (8.9) for non-extreme priors. Third, it may look strange 
that we have drawn an arrow from T> to Tp to model the relation of strong 
analogy as a symmetrical relation. We would like to reply to this objec- 
tion that, then, “analogy” is perhaps not the right word as T> is indeed 
stronger than Tp, and so it makes sense to draw an arrow from Tj to Tp. 
We conclude that the chain of arrows from T; to Tp is indeed plausible. 
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Why Accept a Purported Reduction? 


Under what conditions should we accept a proposed reduction? More 
specifically, given everything we know about the domains of the two the- 
ories, when should we accept a proposed reduction and when should we 
reject it? In the Bayesian framework theories are accepted on the basis of 
their probabilities and confirmation track record. But which probabilities 
are relevant? The previous section focused on the probabilities of Tp and 
Tp individually. But perhaps one is interested in the “package” as a whole, 
that is, the conjunction of Tp and Tp. If so, should we look at the prior prob- 
ability of the conjunction of Tp and Tp after the reduction (that is, without 
accounting for the total evidence)? Or at the posterior probability of the 
conjunction of Tp and Tp, i.e.,the probability of Tp and Tp given the total 
evidence (i.e., E, Er and Ep)? We examine these proposals in turn. 

Let us first compare the prior probabilities of the conjunction of Tp and 
Tp before and after the reduction. Before the reduction, the two theories 
are independent, as expressed in Equation (8.6). For convenience, let us 
restate the condition: 


p(Tr, Tp) = te tp (8.11) 


We now calculate the prior probability of the conjunction of Tp and Tp 
after the reduction and obtain 


p' (Te, Tp) = tr (pe pp + Pep) - (8.12) 


While the expression in Equation (8.11) is an explicit function of tp, the 
expression in Equation (8.12) is not. This is because, after the reduction, 
Tp is no longer a root node, and so it is not assigned a prior probability. 
In order to meaningfully compare the situation before and after the reduc- 
tion, we not only have to assume that p’(Ep|Tp) = p(Ep|Tp) etc., but also 
that p’(Tp) = p(Tp). Let us therefore calculate: 


Ep := p'(Tp) = te pp + te q> (8.13) 


with 
te = p'(Tz) = p' (Tp) = pete + qpte . (8.14) 


Alternatively, we have: 


Ep := (pepp + Prap) te + (Gepp + 979) te (8.15) 
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This equation follows if we insert Equation (8.14) into equation (8.13) or 
by direct calculation from the Bayesian network depicted in Figure 8.3. We 
now require p’(Tp) = p(Tp), ie., 


tp = Ep (8.16) 


and replace tp in Equation (8.11) by the expression for fp given in Equation 
(8.15). 
With this we calculate the difference, 


Ao — p (Tr, Tp) = p(Tr, Tp) (8.17) 
and obtain: 
Ao = (pr — 4) (Pp — 9p) te tr (8.18) 


Hence, 


Theorem 8.3 Ao = 0 iff (pe = gz) or (pp = gp). And Ao > 0 if (pz > gz) and 
if (Pp > qp). 
The first part of the theorem says that if either Tp and Tf are independent 
or if T; and Tp are independent, then Tp and Tp remain independent after 
the reduction. The second part of the theorem says that the conjunction 
of Tp and Tp is more likely after the reduction if Tp confirms Tp and if T5 
confirms Tp. 

Next, let us compare the posterior probabilities of the conjunction of Tp 
and Tp before and after the reduction. To do so, we calculate the difference, 


Ay = p (Tr, Tp|E, Ef, Ep) —= p(Tr, Tp|E, Ep, Ep) (8.19) 


and obtain: 
Ai = (pe — 98) (Pp — 9p) tr tea Ay, (8.20) 
The explicit expression for A; is given in the appendix. Equation (8.20) 
then entails the following theorem: 


Theorem 8.4 A; = 0 if (pz = q-) or (pp = gp). That is, the posterior proba- 
bility of Tp \ Tp equals the prior probability if one of the two equalities above are 
satisfied. 


This result has an intuitive interpretation: If either Tp and Tp or Tp and 
Tp are independent, then the flow of confirmation from Tr to Tp (and vice 
versa) is stopped and the epistemic situation before and after the reduction 
are the same. 

Using the expression for Ai, we obtain: 
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Theorem 8.5 A; > 0 if the following three conditions hold: (i) B,y > 6, (ii) 
O < xf,xp < 1, and (iii) (pt — q¢) (pp — 9p) > 0. That is, the posterior 
probability of Tp \ Tp exceeds the prior probability if the above inequalities are 
satisfied. 


Condition (i) seems natural in the light of inequalities (8.2) and (8.3). In 
fact, it is a rather weak condition which also holds, for example, for Set 2, 
below. Condition (ii) makes sure that Ep confirms Tp and Ep confirms Tp; 
we have assumed this throughout. Condition (iii) is our usual condition 
on the dependency between Tp and Tj, as well as between Tp and Tf. 
Hence, none of these conditions is in any way problematic. Given this, 
we conclude that the posterior probability of the conjunction of Tp and Tp 
indeed increases after a reductive relationship is established between the 
two theories. 


Discussion 


We have discussed how the Generalized Nagel-Schaffner model of reduc- 
tion impacts on the confirmation of theories by evidence. We formulated 
criteria that help us assess proposed reductions epistemically, and we have 
shown how a reduction facilitates the flow of confirmation from the reduc- 
ing theory to the reduced theory and back. 

A GNS reduction between two theories, such as thermodynamics and 
statistical mechanics, is epistemically advantageous in virtue of our main 
results: Theorem 8.1 and Theorem 8.2. Specifically, we have shown that a 
reduction ensures that evidence which, prior to reduction, only supported 
one of the theories, comes to support the other theory as well, due to the 
reduction. Moreover, a successful reduction increases both the prior and 
the posterior probability of the conjunction of both theories (Theorem 8.5). 

Our Bayesian account also shows to what extent the various judgments 
depend on the probabilistic judgments of the scientists, connecting—or 
sO we argue—our account to scientific practice. Disagreement about the 
epistemic value of a reduction can be traced back to disagreement about 
the assignment of the relevant prior probabilities and probabilities. This 
need not be a disagreement about exact numbers and may also take the 
form of qualitative (e.g., ordinal) plausibility judgments. 

As usual, we finish the variation with a series of proposals for follow- 
up projects. 


200 8.4. Discussion 


First, one might propose to accept a proposed reduction if the conjunc- 
tion of Tp and Tp is better confirmed by the evidence after the reduction, 
compared to the situation before the reduction. Determining whether this 
is the case requires an analysis in terms of degree of confirmation. That is, 
one has to choose one of the various confirmation measures (— Variation 
2). Dijadzi-Bahmani, Frigg and Hartmann conduct such an analysis for 
the difference measure d(H, E) and come to the conclusion that the degree 
of confirmation is usually greater if a reduction has taken place than if not. 
Several other confirmation measures have to be checked and the stability 
of these results has to be explored. 

The previous observation suggests that strong coherence between the 
fundamental and the phenomenological theory may be confirmation- 
conducive (Dietrich and Moretti, 2005; Moretti, 2007). So second, it would 
be interesting to compare degrees of coherence before and after an in- 
tertheoretic reduction has taken place (Bovens and Hartmann, 2003). Here, 
one might want to focus on the two theories in question, or on the con- 
junction of the theories and all available evidence. It might be reasonable 
to focus on the latter, as the evidence is also uncertain and one might, in 
the end, be interested in the coherence of the entire package, comprising 
all available theories and all available evidence. Should coherence consid- 
erations play a role when it comes to decide whether a theory should be 
accepted? 

Third, one may want to examine the situation where evidence for, say, 
the fundamental theory disconfirms the phenomenological theory. How 
shall one assess the value of a reduction in these situations? 

Fourth and finally, other types of intertheoretic relation should be stud- 
ied from a Bayesian point of view. Here, we are thinking of “stories” (Hart- 
mann, 1999) and singular limits (Batterman, 2002). But there will surely be 
other examples. This project requires the collaboration between philoso- 
phers of science, who conduct case studies, and formal philosophers, who 
provide the corresponding Bayesian analysis. It may also be asked which 
picture about the structure of science as a whole emerges from all this. It 
seems plausible to find something like a network structure, with more or 
less connected theories and models, and it might be interesting to discuss 
the implications of this for the debate about the (dis-)unity of science. 
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Proofs of the Theorems 


Let us start with the situation before the reduction and the Bayesian net- 
work represented in Figure 2. The joint distribution p(Tr, Tp, E, Er, Ep) is 
given by the expression 


p(Tr) p(Tp) p(E|Tr, Tp) p(Er|Tr) p(Ep|Ep) - 


Using the methodology described in Bovens and Hartmann (2003, Ch. 3), 


we obtain: 


p(TrE) =  )/  p(Tr, Tp, E, Er, Ep) 
Tp,Ep,Ep 


= tp(tpa+tpB) (8.21) 


Similarly, we calculate 


p(Tp,E) = tp(tra+try) (8.22) 
p(E) = tr(tpattpB) +tp (tpy + tp do) (8.23) 
= tp(tpattey)+tp(teB+te) (8.24) 


To prove Equation (8.2) we note, using the definition of conditional proba- 
bility, that p(Tp|E) > p(Tp) iff p(Tp, E) — p(Tp) p(E) > 0 and obtain using 
Equations (8.22) and (8.24) 


p(Tp,E) — p(Tp) p(E) = tp tp [((a— B) tr + ( — 9) Er] , (8.25) 


from which Equation (8.2) immediately follows. The proof of Equation 
(8.3) proceeds accordingly using Equations (8.22) and (8.23). 
Next, we calculate the prior probability of the two theories. 


platy). = Vs ply To EEE) 
EEAES 


p(Tr) p(Tp) = te tp 


Similarly, we obtain for the posterior probability P{ := p(Tr, Tp|E, Eg, Ep): 


p(Tr, Tp, E, Er, Ep) 


PF = 
: p(E, Ep, Ep) 


trip prppa 
trtpppppat+trtpprqpB+trtpqrppy+trtpqrqpd 
tripa 
= _ E 2 (8.26 
tp (tpa + tp xp B) + te xp (tp y + tp xpd) 
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with the probability ratios xp := qr/pr and xp := qp/pp. 


Let us now turn to the situation after the reduction and the 


Bayesian network represented in Figure 3. The joint distribution 


p' (Tr, Tp, Tz, Tp, E, Ez, Ep) is given by 


p'(Tr) p'(E|Tr, Te) p'(Er|Tr) p'(Ep|Ep) p’(Tp|Tp) p’ (Tp Te) p’ (Te ITr) . 


To simplify our notation, we introduce the following abbreviations: 


Pu'=PrPpt+Pedp +» Pp:=PEPpt+Pr Ap 
Py'=4F Pp+4p 4p» G6 = 4¢ Pp t+ Fp Wp 
For later use, we note that 0 < Qu, Pp, Py, Ps < 1 and 
Pu — Py = %5— Pp = (PE-—4F) (PP 4p) 
Pa + PB= Py + P5 = ee 


We then obtain for the prior probability of the conjunction of both theories 


after the reduction 
p' (Tz, Tp) = tr Pe. 
For the posterior P; := p'(Tg, Tp|E, Ep, Ep), we obtain: 


pra tra Py 
ote (@ Dx + xp B pp) + te xe (Y Py + xP 6 5) 


Similarly, we calculate 


p(Tp) = tr@at+tr py 
te Py tte xp Py 
'(TpIE = 2 
p (Tp|Er) ae 
tr (Pu + Xp Gp) 
'(TrlE — — : 
EE) te (~u + xp pg) + te (Py + XP Go) 


We now calculate 


tr tr (Pa — Py) (1 — xe) 
te +tp xe 


p (Tp|Er) — p’(Tp) 


pe (trp + tp xr) 


This proves Theorem 8.1. Similarly, we calculate 


tr tr (Pa — Py) (1— xp) 
te (Pa +xXp pg) + te (Py + Xp Ps) 


p (Tr|Ep) — p'(Tr) = 


te te (pe — 9) (PE — 46) (Pp — 4p) 


(8.29) 


(8.30) 


(8.31) 


(8.32) 


(8.33) 
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tr tr (pp — 9p) (Pe — 4) (Pp — 9p) 


pp [tr (Pu + xp pp) +t (Py + xP go) | 


which proves Theorem 8.2. 
To prove Equation (8.18), we note that, using Equation (8.29) 


Ao = (Pa = tp) tr. 
We now use Equations (8.16) and (8.31) and obtain 


Ao = (@e:-— te Oy — te py) tr 
= (Gu— Py) trtr. 


Equation (8.18) then follows using Equation (8.27). 


y 


Let us finally calculate A; using Equations (8.26) and (8.30). We obtain 


Ai = (Ga — Py) te te a Ay, 


with 
A, = N,!Nyz'- A, 


and 


N, = te(tpattpxpB) +t xp (tpy + tp xpd) 
No. = te (a pu +xp Bh pg) +t xE(Y Py +xP5 Qs) - 


(8.34) 


(8.35) 


Note that N,,N2 > 0. We are therefore most interested in A‘, which is 


given by 


Ai = texe (@e— gy) (y—Oxp) +texp (B —Oxe) 
+ YY PyXA +O @, Xr XP. 


From conditions (i) and (ii) of Theorem 8.5, we conclude that y > 6 xp and 


> dxp. Hence A’ > 0, which proves the theorem. 
1 Pp 


204 


Variation 9: Hypothesis Testing and 
Corroboration 


Scientific reasoning often proceeds by testing hypotheses and appraising 
how well they have stood up to the test. For critical rationalists such as 
Karl R. Popper (2002), the critical attitude that we express by repeatedly 
testing our best scientific theories even constitutes the basis of rational in- 
quiry about the world. Arguably, such tests have already been conducted 
in antiquity—think of Erastothenes’ test of the hypothesis that the Earth 
is round, conducted by comparing the height of the sun in two different 
places at the same time. However, only in the middle of the 20th century, 
the design and interpretation of hypothesis tests has been formalized and 
standardized. The emergence of the discipline of statistics, the science of 
analyzing and interpreting data, played a crucial role in this process. It 
provided science with probabilistic tests, above all null hypothesis signif- 
icance tests (NHST), which have acquired a predominant role in scientific 


reasoning. 


NHST test a precise hypothesis Ho—the “null” or default hypothesis— 
against an unspecific alternative H;. In the most common form of NHST, 
the null hypothesis posits a precise value for a real-valued parameter @ 
(Ho : 6 = 60), while the alternative (H; : 6 # 6) is a disjunction of in- 
finitely many precise hypotheses (e.g., Neyman and Pearson, 1933, 1967; 
Fisher, 1956). The null hypothesis standardly denotes an absent or negligi- 
ble effect (e.g., a new medical drug is not better than a placebo treatment) 
whereas the alternative stands for a sizeable effect. NHST are applied 
across all domains of science, but they are especially prominent in psy- 
chology and medicine. 


Despite their popularity in scientific inference, the philosophical foun- 
dations of NHST are shaky at best. NHST are used for quantifying ev- 
idence that the data accumulate against the null hypothesis. When this 
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level of evidence is high enough, i.e., greater than a prespecified signif- 
icance threshold, the null hypothesis is rejected. See Figure 9.1 for an 
illustration. The more the observed value lies in the tail of the distribu- 
tion, the more it counts as evidence against the null hypothesis (Fisher, 
1956; Mayo, 1996, e.g.,). Mathematically, the significance level is captured 
by the notorious p-value: for a random variable X with realization x and 
a function z(X) that measures the distance to the null hypothesis (e.g., 
the difference between the null mean and the actual sample mean), the 
p-value describes the probability of obtaining a result that speaks as least 
as much against the null hypothesis as the actual result. 


P= pry (2(X)| = lz(x)I), (9.1) 


Standard Normal density. Size of shaded area = p-value. 


04 


0.3 


0.2 


01 


Figure 9.1: The shaded area indicates the set of observations where the 
null hypothesis Ho is “rejected”. Here, Ho denotes the hypothesis that 
the observations follow a standard Normal distribution with mean value 
6 = 0 as opposed to 6 ¥ 0. 


However, there is barely any methodological guidance on how we 
should interpret a non-significant result, that is, a result where we fail 
to reject the null hypothesis. Statistics textbooks (e.g., Chase and Brown, 
2000; Wasserman, 2004) restrict themselves to a purely negative interpre- 
tation: failure to reject the null means failure to demonstrate a statistically 
significant phenomenon. 
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To illustrate this point, consider a Binomial model where we are testing 
the hypothesis that the coin is fair. To what extent does a result of 52 heads 
in 100 independent tosses corroborate the null hypothesis? Neither x = 52 
nor x = 58 qualifies as significant evidence against the null hypothesis at 
the p = 0.05 level, but there is certainly a difference in the performance 
of the null hypothesis on that particular dataset. The classical statistical 
methodology, which refuses to interpret p-values greater than 0.05, fails 
to quantify this difference. For example, p-values of .15 and .35 have, for 
all practical purposes, the same meaning: they are above the range where 
results are statistically significant, and therefore no evidence against the 
null. 

All in all, the standard NHST method does not address the question 
whether the results corroborate the null hypothesis. Should we prefer 
the null hypothesis to the alternative hypotheses and preliminarily accept 
it? Whenever the null hypothesis is of substantial scientific interest, e.g., 
independence of two variables in a causal model, the safety of a medical 
drug or the adequacy of a phylogenetic tree, such judgments are urgently 
required. This fact is also acknowledged by numerous scientists. For two 
recent examples from psychology, see Gallistel (2009) and Morey et al. 
(2014). 

Explicating (degree of) corroboration is thus central for a sound inter- 
pretation of NHST. Karl R. Popper, one of the few philosophers engaging 
in this debate, proposed the following characterization: 


By the degree of corroboration of a theory I mean a concise 
report evaluating the state (at a certain time ft) of the critical 
discussion of a theory, with respect to the way it solves its 
problems; its degree of testability; the severity of tests it has 
undergone; and the way it has stood up to these tests. Cor- 
roboration (or degree of corroboration) is thus an evaluating 
report of past performance. Like preference, it is essentially 
comparative. (Popper, 1979, 18) 


In Popper’s view, corroboration judgments positively appraise the per- 
formance of the null hypothesis in a severe test, rather than just stating 
the failure to find significant evidence against it. Notably, high degrees of 
corroboration need not guide us to the truth (Popper, 1979, 21). Instead, 
the function of corroboration is comparative and pragmatic: it guides our 
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practical preferences over competing hypotheses, for example the choice 
of the hypothesis on which we base the next experiment (Popper, 2002, 
416). This is exactly what most scientists are after when testing a complex 
set of hypotheses. 


The corroboration-based approach to scientific reasoning should not 
be confused with confirmation-based reasoning. While (Bayesian) confir- 
mation is based on increase in degree of belief, corroboration does not 
imply any confidence in the tested hypothesis: it is just the statement that 
a hypothesis has survived severe tests. Popper (2002, ch. 8 and 10, ap- 
pendix vii) even argued for the impossibility of inductive (Bayesian) prob- 
ability while defending the epistemic role of corrorboration. According 
to Popper’s corroboration-centered perspective, scientific progress occurs 
through successive elimination of hypotheses, and degrees of corrobora- 
tion guide practical preferences over the competing hypotheses. This is 
something very different from a probabilistic inductive logic in the style 
of Carnap (1950). 


This variation, which partly builds on results from Sprenger (2016d), 
explores the prospects for a corroboration-based epistemology of NHST. 
We begin with a conceptual demarcation of degree of corroboration ver- 
sus degree of confirmation (Section 9.1). Then, we discuss Popper’s own 
explication of corroboration (Section 9.2) and address the more general 
question of whether testability and past performance may be synthesized 
into a single measure of corroboration (Section 9.3). The answer is neg- 
ative: no such measure can simultaneously satisfy a set of desirable con- 
straints. This seems to create insurmountable problems for the project of 
explicating corroboration, but they can be solved by moving to a differ- 
ent statistical framework. We construct a measure of corroboration that 
fruitfully applies Popperian thinking to hypothesis tests and that can be 
understood as a generalization of Bayesian inference (Section 9.4). Finally, 
we compare this measure to p-values in NHST and standard Bayesian in- 
ference (Section 9.5) and we provide the proofs of our results (Section 9.6). 
While the practical merits of the new corroboration measure are still to be 
evaluated, it demonstrates two important theoretical insights: First, we can 
provide a valid interpretation of non-significant results in NHST. Second, 
Popperian and Bayesian approaches to hypothesis testing may, in the end, 
be less fierce opponents in hypothesis testing than the popular picture has 
it. 
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Confirmation versus Corroboration 


The point of measuring corroboration is to quantify the extent to which a 
hypothesis has stood up to an attempt to refute it. Thus, degree of cor- 
roboration gives an evaluating report of past performance. For the case of 
a hypothesis that makes deterministic predictions, corroborating evidence 
is intuitively defined as evidence that conforms to the predictions of the 
tested hypothesis. The more specific the evidence, the more it corroborates 
the hypothesis. 

This rationale essentially corresponds to the hypothetico-deductive 
model of theory confirmation (e.g., Gemes, 1998): observed logical con- 
sequences of a theory confirm it. While this model may be adequate as a 
qualitative theory of corroboration, it is not applicable to NHST. Here, a 
different, quantitative model has to be developed that applies to statistical 
inference (see also Popper, 2002, 265-266). 

However, do we really need the concept of corroboration to explicate 
this aspect of NHST? Can’t we just describe the results of NHST in terms 
of degree of confirmation? According to Bayesian Confirmation Theory, 
evidence E confirms hypothesis H if and only if p(H|E) > p(H), where 
p represents an agent’s subjective degrees of belief. That is, E confirm H 
if and only if E increases the agent’s subjective degree of belief in H (e.g., 
Fitelson, 2001b, see also Variation 2 of this book). Before introducing a 
new and complex concept—corroboration—we first need to argue why it 
is not coextensive with confirmation as increase in firmness. 

In other words, we have to address the Monism Thesis: the epistemic 
function of the concept of corroboration can be taken over by the Bayesian 
concept of confirmation as increase in firmness. The monist replaces a 
judgment of corroboration by a judgment of confirmation. This line of 
argument gains support from authors such as Howson and Urbach (2006) 
and Wagenmakers et al. (2011), who argue that NHST should be aban- 
doned and be replaced by Bayesian hypothesis testing. 

We shall now present three objections to the Monism thesis. This does 
not rule out that explications of corroboration and confirmation agree nu- 
merically: rather, the point is to show that the two concepts are not redun- 
dant and need different explication strategies. 


Objection 1: Corroboration does not aim at inferring probable 
hypotheses, or raising our degree of belief in the tested hypoth- 
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esis. 


This objection contends that scientific hypotheses and models are ide- 
alizations of the external world, which are judged by their ability to cap- 
ture relevant causal relations and to predict future events (see the survey 
of Frigg and Hartmann, 2012). The epistemic function of corroboration 
consists in determining whether the data are consistent with the tested 
hypothesis, or whether the results agree “well enough” with the null hy- 
pothesis Ho that we may use it as a proxy for a more general statistical 
model. 

Consequently, corroborated hypotheses should not be regarded as true 
or empirically adequate, but as useful and tractable idealization of a gen- 
eral statistical model (Bernardo, 2012; Gelman and Shalizi, 2012, 2013). 
Corroboration is a guide to practical preference over competing hypothe- 
sis, but it does not ground confidence in the truth of the tested hypothesis 
(Popper, 2002, 281-282). 

Degree of confirmation, on the other hand, is defined by the change of 
confidence in a hypothesis. Characteristically, all confirmation measures 
c(H,E) possess the Final Incrementality Property familiar from Variation 
2: 


c(H,E) > c(H,E’) ifand only if — p(H|E) > p(HI|E’). (9.2) 


This condition demands that E confirms H more than E’ if and only if 
E raises the probability of H to a higher level than E’ does (Festa, 2012; 
Crupi, 2013). However, corroboration is about past performance, not about 
epistemic or psychological attitude. In a nutshell, rather than a (subjective) 
measure of belief change, corroboration ought to be an (objective) measure 
of past performance. Indeed, even if we did not have subjective degrees 
of belief in the tested hypothesis or were unable to elicit them, we should 
still be able to assess the past performance of the null hypothesis by a 
judgment of corroboration. 


Objection 2: On a Bayesian account, hypotheses with prior 
probability p(H) = 0 cannot be confirmed. Yet, they are per- 
fectly acceptable candidates for being corroborated. 


This point was first raised by Karl R. Popper (2002, appendix vii). As 
a consequence of Bayes’ Theorem, any hypothesis with prior probability 
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p(H) = 0 also has posterior probability p(H|E) = p(H) p(E|H)/p(E) = 0. 
No such hypothesis can be confirmed in the sense of increase in firmness. 
But certainly, they can be corroborated: after all, scientists often deal with an 
uncountable set of candidate hypotheses where all singleton hypotheses 
receive zero weight (e.g., different values of a physical parameter). Testing 
whether such hypotheses are good idealizations of reality certainly makes 
sense. 

This argument is in line with the practice of Bayesian statistics. 
Bayesian hypothesis tests often assign zero weight to the null hypothesis 
Ho : 6 = 6, e.g., by assigning a continuous prior over the entire param- 
eter space. Whatever the measure of evidence that the Bayesian uses for 
appraising the null in such tests (e.g., a density-based measure such as the 
Bayes factor), it cannot be a classical Bayesian confirmation measure that 
compares p(H|E) to p(H). The Bayesian apparatus may be a convenient 
mathematical tool for performing such hypothesis tests, but it stands in 
need of a philosophical rationale regarding the outcomes of the analysis. 


Objection 3: Corroboration is a way more asymmetric notion 
than confirmation. 


The logic of NHST is asymmetric: unlike the null hypothesis, the al- 
ternative is usually not a precise hypothesis, like in our introductory ex- 
ample of testing Ho : 8 = 6 against H; : @ # 0. As explained above, 
NHST often aims at finding out whether the null hypothesis is a good 
proxy for the more general model represented by the alternative. Finding 
the null hypothesis highly corroborated is a precise conclusion in favor 
of the null whereas a “rejection” of the null leaves open which of the 
alternatives is corroborated. Confirmation judgments, however, are sym- 
metric: disconfirmation of H is also confirmation of =H, and sometimes, 
it is also demanded that c(—=H,E) = —c(H,E) (Crupi et al., 2007). That 
is, while confirmation measures pitch a hypothesis H against its negation 


+H, measures of corroboration pitch Hp against a set of distinct alterna- 
tives. Section 9.4 will elaborate this idea. 

These objections undermine the Monism Thesis sufficiently to moti- 
vate that the concept of corroboration stands in need of an independent 
explication and cannot be reduced to degree of confirmation. We begin by 
discussing Popper’s classical proposal for a measure of corroboration. 
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Popper’s Measure of Degree of Corroboration 


Popper’s first writings on degree of corroboration, in Chapter 10 of “Logic 
of Scientific Discovery” (1934/2002), do not engage in a quantitative ex- 
plication. Apparently, this task is deferred to a scientist’s common sense 
(see, e.g., Popper, 2002, 265-267). However, this move makes the entire 
concept of corroboration vulnerable to the charge of subjectivism: without 
a quantitative criterion, it is not clear which corroboration judgments are 
sound and which aren’t (Good, 1968b, 136). Especially if we aim at gaining 
objective knowledge from hypothesis tests, we need a precise explication 
of degree of corroboration. 

Popper faces this challenge in a couple of BJPS articles (Popper, 1954, 
1957, 1958) that form, together with a short introduction, appendix ix of 
“Logic of Scientific Discovery”. In these articles, Popper develops and 
defends a measure of degree of corroboration. Popper argues that this 
measure cannot be a probability in the sense of Carnap (1950), that is, 
the plausibility of the tested theory (or hypothesis) conditional on the ob- 


served evidence: 


[...] the probability of a statement [...] simply does not 
express an appraisal of the severity of the tests a theory has 
passed, of the manner in which it has passed these tests. (Pop- 
per, 2002, 411) 


In particular, logical content and informativity contribute to the testability 
of a theory and to its degree of corroboration: 


The main reason for this is that the content of a theory—which 
is the same as its improbability—determines its testability and 
corroborability. (ibid., original emphasis) 


Recall that testability, identified with the empirical content or informa- 
tivity of a hypothesis, is an essential cognitive value for Popper: being 
testable is a hallmark of science as opposed to pseudo-scientific theories 
that can be reconciled with all types of empirical evidence. Popper’s clas- 
sical examples are psychoanalysis and Marxist economics. While pseudo- 
scientific theories are a lens to watch the world rather than statements 
about the world (e.g., Marxists interpret all economic developments as 
following the logic of class struggle), genuinely scientific theories make 
testable predictions and may be refuted empirically. 
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Also in Popper’s characterization of corroboration, testability is as- 
signed a crucial role. Corroboration should be sensitive to the informa- 
tivity and logical content of a theory, which is again related to the improb- 
ability of a theory. If one considers that degree of corroboration should 
guide our judgments of acceptance in NHST, this makes a lot of sense: 
good theories should agree with observed evidence and be informative 
(see the discussions in Hempel, 1960; Levi, 1963; Huber, 2005). Popper 
confirms that scientific theory assessment pursues both goals at once: 


Science does not aim, primarily, at high probabilities. It aims 
at a high informative content, well backed by experience. But 
a hypothesis may be very probable simply because it tells us 
nothing, or little. (Popper, 2002, 416, original emphasis) 


Such a characterization of corroboration is attractive because it amalga- 
mates two crucial cognitive values in theory assessment: high informative 
content and empirical confirmation. Also in NHST, both values play a role 
since a precise hypothesis (the null) is tested against a continuum of alter- 
natives. However, this variation shows that such a tradeoff is unattainable 
if further reasonable assumptions are made. 

Let us now look at how Popper characterizes degree of corroboration. 
Transcribed to modern notation, Popper assumes that evidence E and hy- 
pothesis H are among the closed sentences £ of a first-order language L. A 
corroboration measure is described by a function C : £? x $8 + R, where 
$B is the set of probability measures on the o-algebra generated by £. This 
function assigns a real-valued degree of corroboration C(H, E) to any pair 
of sentences in £, together with a probability measure p. This measure 
may be interpreted as a function of the logical structure of L, but also as 
objective chance or degree of belief—our discussion is independent of this 
point. For the sake of simplicity, we will omit reference to background 
assumptions and assume that they are implicit in the probability function 
p. 

Note that such a probabilistic measure of corroboration does not cap- 
ture all aspects of corroboration. Popper (2002, 265-266, 402, 437) and also 
his modern followers (e.g., Rowbottom, 2008, 2011) emphasize that cor- 
roborating evidence has to report the results of sincere and severe effort 
to overturn the tested hypothesis. Obviously, such requirements cannot 
be formalized completely (see also Popper, 1983, 154). We cannot infer 
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reversely from a high (probabilistic) degree of corroboration to a sound 
experimental design. The point of a probabilistic measure is rather to de- 
scribe the degree of corroboration of a hypothesis if all important method- 
ological requirements are met. 

Popper then specifies a set of adequacy criteria I-IX for degree of cor- 


roboration as a function of empirical performance. 
I C(H,E) >/=/< 0 if and only if p(E|H) >/=/< p(E). 


This is a classical statistical relevance condition: E corroborates H just in 
case supposing H makes E more expected. This condition is also in line 
with Popper’s remark that corroboration is, like preference, essentially 
contrastive (Popper, 1979, 18). 


Il —1 = C(-H,H) < C(H,E) < C(H,H) <1. 
Il C(H,H) =1- p(H). 

IV IfE EK H then C(H,E) =1— p(H). 

V If E — 7H then C(H,E) = —1. 


These conditions determine under which conditions the measure of cor- 
roboration takes its extremal values. Minimal degree of corroboration is 
obtained if the evidence refutes the hypothesis (V). Conversely, the most 
corroborating piece of evidence E is a verification of H (II). In that case, 
degree of corroboration is equal to 1 — p(H) (IIL, IV), which expresses the 
informativity, testability and logical content of H. This is especially plausi- 
ble in Carnap’s logical interpretation of probability, which Popper adopts 
for p(H). But it also makes sense for a subjective Bayesian interpretation. 
See Popper (2002, 268-269), Popper (1963, 385-387), Rowbottom (2012, 
741-744). 


VI C(H,E) > 0 increases with the power of H to explain E. 


VIL If p(H) = p(H’), then C(H,E) > C(H’,E’) if and only if p(H|E) > 
p(H"E’). 


These conditions reiterate the statistical relevance rationale from con- 
dition I, and make it more precise. Regarding condition VI, Popper 
(2002, 416) defines explanatory power according to the formula €(e,h) = 
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(p(E|H) — p(E))/(p(E|H) + p(E)), another measure of the statistical rele- 
vance between E and H. But the details need not bother us here. Condi- 
tion VII states that corroboration essentially co-varies with posterior prob- 
ability whenever two hypotheses are equiprobable at first. In that case, 
posterior probability is a good indicator of past performance. In compar- 
ison to Popper’s original formulation, we have dropped the requirement 
p(H) > 0 because by Bayes’ Theorem, the case p(H) = p(H’) = 0 would 
imply p(H|E) = p(H’|E’) = 0 and trivialize the condition. 


VI If H — E, then 


a) C(H,E) > 0; 
b) C(H,E) is an increasing function of 1 — p(E); 


c) C(H,E) is an increasing function of p(H). 
IX If 4H is consistent and =H F E, then 


a) (HE) <0 
b) C(H,E) is an increasing function of p(E); 
c) C(H,E) is an increasing function of p(H). 


Condition VIII demands that corroboration gained from a successful de- 
ductive prediction co-vary with the informativity of the evidence and the 
prior probability of the hypothesis. Condition IX mirrors this requirement 
for the case =H | E. These conditions can be motivated from the idea 
that if H — E, then corroboration should not automatically transfer to hy- 


potheses H / H’ that contain an “irrelevant conjunct” H’ which has not yet 
been tested. See the next section for more detailed discussion of this point. 

Popper (1954, 359) then proposes the corroboration measure Cp(H, E) 
which satisfies all of his constraints: 


p(E|H) — p(E) 
CPE) = (ETH) + P(E) = pCEIH pC) _ 
But we can easily see that an essential motivation behind a measure of 
degree of corroboration is not satisfied. Cp(H,E) is an increasing function 
of p(H) for all values of p(E|H) and p(E). Hence, the informativity and 
testability of the hypothesis, as measured by 1 — p(H), never contributes 
to its degree of corroboration. This violates Popper’s informal charac- 
terization of the concept and does not square well with the practice of 
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NHST. Diez (2011) provides even more reasons why Popper’s explication 
is at odds with the tenets of critical rationalism. We shall now phrase this 
problem more generally and show that it does not only arise for Popper’s 
measure Cp(H,E), but for all corroboration measures that are motivated 
from the same intuitions; that is, measures that aim at capturing statistical 
relevance and testability at the same time. 


The Impossibility Results 


Popper’s nine adequacy conditions are quite specific requirements and too 
strong for the purpose of a general analysis of degree of corroboration. 
We will therefore weaken them and retain only those adequacy conditions 
that are indispensable for a conceptual analysis of corroboration. We then 
proceed to showing two impossibility results for corroboration measures 
that (i) build on statistical relevance between H and E and the predictive 
success of H for E; and (ii) preserve the intuition that corroboration should 
be responsive to the informativity and testability of the tested hypothesis. 

First, we impose a condition which is mainly representational in nature 
and is frequently used in Bayesian Confirmation Theory and formal epis- 
temology more generally (see Variation 2, 6 and 7 of this book for details). 
Popper’s own measure Cp(h,e) also conforms to it. 


Formality There exists a function f : [0,1] x {(x,y,z)|l+xz-z>y> 
xz} — R such that for all E, H € Land pe ¥, 


C(H,E) = f(p(E|H), p(B), p(A)). 


This condition relates degree of corroboration to the joint probability dis- 
tribution of E and H. The three arguments of f determine that distribution 
in all non-degenerate cases, and they are the same quantities that figure 
in Popper’s measure of corroboration Cp(H,E). This makes comparisons 
easier. Formality means that two scientists who agree about all relevant 
probabilities will make the same corroboration judgments.* 

In a Popperian spirit, we now demand that corroboration track predic- 
tive success (e.g., Popper, 1983, 241-243): 


“Note that the corroboration measure is not defined on the entire unit cube [0, 1]° since 
not all assignments of p(E|H), p(E) and p(H) are compatible with each other. This is 
evident from the equality 


p(E) = p(E|H)p(H) + p(E|>H) (1 — p(A)) 
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Weak Law of Likelihood (WLL) For mutually exclusive hypotheses 
H1,H2 € £, Ee Land pe Xf, if 


p(E|Hi) = p(E|H2) = and —p(E|7Hi) < p(E|-H2) ~— (9.4) 
with one inequality being strict, then C(Hi,E) > C(H2,E). 


The WLL has been defended as capturing a “core message of Bayes’ The- 
orem” (Joyce, 2008): if Hi predicts E better than H2, and —H)p predicts 
E better than —Hj, then E favors H; over H2. Since WLL is phrased in 
terms of predictive performance, it is even more compelling for corrobora- 
tion than for degree of confirmation. After all, p(E| +H) and p(E| + Hz) 
measure how well H; and Hz have stood up to a test with outcome E. The 
version given here is in one sense stronger and in one sense weaker than 
Joyce’s original formulation: it is stronger because only one inequality has 
to be strict (see also Bréssel, 2013, 395-396); it is weaker because the WLL 
has been restricted to mutually exclusive hypotheses, where our intuitions 
tend to be more reliable. 

The next condition deals with the role of irrelevant evidence in corrob- 
oration judgments: 


Screened-Off Evidence Let Ej,E2,H € £ and p € $B. If Es is proba- 
bilistically independent of E,;, H, and E; \H and p(E2) > 0, then 
C(H,E;) = C(H, Ey A Ep). 


Structurally identical versions of this condition prominently figure in 
explications of confirmation and explanatory power (e.g., Kemeny and 
Oppenheim, 1952; Schupbach and Sprenger, 2011). It is a weaker ver- 
sion of condition (9.2) which demands, translated to corroboration, that 
C(H,E) = C(H,E’) if and only if p(H|E) = p(HIE’). To see this, just 
choose E := Ej, E’ := E; \ Ep and note that under the independence con- 
ditions of Screened-Off Evidence, 


p(H /\ E,|E>) 
p(Ei|E2) 


p(H\E; \ Ey) = = p(HIE;) 


which implies, by setting p(E|=H) to its extremal values, the inequalities 


p(E) = p(E|H)p(H) p(E) < p(E|H)p(H) + 1— p(X). 
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Hence, anybody who accepts condition (9.2) for measures of corrobora- 
tion also needs to endorse Screened-Off Evidence. However, Screened- 
Off Evidence is also very sensible on independent grounds: in an ex- 
periment where H has been tested and (relevant) evidence E; has been 
observed, completely irrelevant extra evidence (Ep Jl E;, H, E; \ H) should 
not change the evaluation of the results. Imagine, for example, that a sci- 
entist tests the hypothesis that voices with high pitch are recognized more 
easily. As her university is interested in improving the planning of lab 
experiments, the scientist also collects data on when participants drop in, 
which days of the week are busy, which ones are quiet, etc. Plausibly, 
these data satisfy the independence conditions of Screened-Off Evidence. 
But equally plausibly, they do not influence the degree of corroboration of 
the hypothesis under investigation. 

The next adequacy condition is motivated by the problem of irrele- 
vant conjunctions for confirmation measures (e.g., Hawthorne and Fitel- 
son, 2004). Assume that hypothesis H asserts the wave nature of light. 
Taken together with a body of auxiliary assumptions, H implies the phe- 
nomenon E: the interference pattern in Young’s double slit experiment. 
Such an observation apparently corroborates the wave nature of light. 

However, once we tack an utterly irrelevant proposition such as H’ 
= “the chicken came before the egg” to the hypothesis, it seems that E 
corroborates H ( H’—the conjunction of the wave theory of light and the 
chicken-egg hypothesis—not more than H, if at all. After all, H’ was in no 
way tested by the observations we made. It has no record of past perfor- 
mance to which we could appeal. This problem, familiar from Bayesian 
Confirmation Theory (see Variation 2), motivates the following constraint: 


Irrelevant Conjunctions Assume the following conditions on H,H’,E € £ 
and p € $f are satisfied: 

1) Hand H’ are consistent and p(H \ H’) < p(H); 

p(E) € (0,1), 

HEE; 


) 
) 
) 
) 
Then it is always the case that C(H ( H’,E) < C(H,E). 


This requirement states that for any non-trivial hypothesis H’ that is con- 
sistent with H (condition 1) and irrelevant for E (condition 4), H A H’ is 
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corroborated no more than H whenever H non-trivially entails E (condi- 
tions 2 and 3). A similar requirement has been defended for measures of 
empirical justification (Atkinson, 2012, 50-51). Indeed, it would be strange 
if corroboration (or justification) could be increased for free by attaching 
irrelevant conjunctions. That would also make it nearly impossible to re- 
ply persuasively to Duhem’s problem, and to separate innocuous from 
blameworthy hypotheses. Degree of corroboration is supposed to guide 
our evaluation of hypotheses in the light of experimental results. But a 
measure which is invariant under logical conjunction of hypotheses (for 
deductively implied evidence) cannot fulfil this function. 

Interestingly, the preceding adequacy conditions can be derived from 
Popper’s original adequacy conditions (all proofs are given in the ap- 
pendix): 


Theorem 9.1 The following statements are true: 


e Popper's condition VII implies Weak Law of Likelihood for the case of 
equiprobable hypotheses. 


e Popper's condition VII implies Screened-Off Evidence. 
e Popper's condition VIIIc implies Irrelevant Conjunctions. 


This shows that our adequacy conditions are motivated in the right 
way: they are either weaker versions of Popper’s criteria, or closely related 
to them. We can thus be confident that our formal analysis of corrobora- 
tion is on target and that our adequacy conditions do not track a different, 
incompatible concept. 

However, unlike confirmation, corroboration contains an element of 
severe testing: the hypothesis should run a risk of being falsified. High 
informativity and testability contribute to this goal. As Popper states, “in 
many cases, the more improbable [...] hypothesis is preferable” (Popper, 
1979, 18-19), and the purpose of a measure of degree of corroboration is 
“to show clearly in which cases this holds and in which it does not hold” 
(ibid.). This motivates the following desideratum: 


Weak Informativity Degree of corroboration C(H,E) does not generally 
increase with the probability of H. That is, there are H, H’,E € £ and 
p € 8 such that 


(1) p(E|H) = p(E|H’) > p(E); 
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(2) 1/2 > p(H) > p(A’); 
(3) C(H,E) < C(H’,E). 


The intuition behind Weak Informativity can also be expressed as follows: 
corroboration does not, in the first place, assess the probability of a hy- 
pothesis; therefore C(H,E) should not always increase with the probabil- 
ity of H. To this, the following condition—Strong Informativity—adds that 
low probability/high logical content can in principle be corroboration- 
conducive. Note that the requirement 1/2 > p(H), p(H’) is purely techni- 
cal and philosophically innocuous. 


Strong Informativity The informativity/logical content of a proposition 
can increase degree of corroboration, ceteris paribus. That is, there 
are H,H’,E € Land p € ¥ such that 


(1) p(E|H) = p(E|H’) > p(E); 
(2) 1/2. p(H) > pl); 
(3) C(H,E) < C(H’,E). 


To our mind, any account of corroboration that denies these properties has 
stripped itself of its distinctive features with respect to degree of confir- 
mation. At the very least, the Popperian characaterization of corrobora- 
tion as capturing both predictive success and testability would have to be 
abandoned, and links with NHST would have to be loosened. The idea 
behind Strong/Weak Informativity has also recently been defended by 
Roberto Festa in his discussion of the “Reverse Matthew Effect”: success- 
ful predictions reflect more favorably on powerful general theories than 
on restricted or weakened versions of them (Festa, 2012, 95-100). Note 
that neither Strong nor Weak Informativity postulates that corroboration 
decreases with prior probability; they just deny the “Matthew Effect” that 
corroboration co-varies with prior probability (see also Roche, 2014). 

We will now demonstrate that the listed adequacy conditions are in- 
compatible with each other. First, as a consequence of Weak Law of Like- 
lihood, corroboration increases with the prior probability of a hypothesis. 
This clashes directly with Strong/Weak Informativity: 


Theorem 9.2 No measure of corroboration C(H,E) constructed according to 
Formality can satisfy Weak Law of Likelihood and Weak/Strong Informativity 
at the same time. 
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Since Formality is a purely representational condition, this result 
means that Weak Law of Likelihood and Weak/Strong Informativity pull 
in different directions: the first condition emphasizes the predictive per- 
formance of the tested hypothesis, the second its logical strength. It is per- 
haps surprising that these two conditions are already incompatible, since 
it is a popular tenet of critical rationalism that informative hypotheses are 
also more valuable predictively. 

Second, Strong Informativity clashes with Irrelevant Conjunctions and 
Screened-Off Evidence: 


Theorem 9.3 No measure of corroboration C(H,E) constructed according to 
Formality can satisfy Screened-Off Evidence, Irrelevant Conjunctions and Strong 
Informativity at the same time. 

Thus, the intuition behind Strong Informativity cannot be satisfied if 
other plausible adequacy constraints on degree of corroboration are ac- 
cepted. In particular, if a measure of corroboration is insensitive to irrel- 
evant evidence and does not reward adding irrelevant conjunctions, then 
it cannot give any bonus to informative hypotheses. The less informative 
and testable a hypothesis is, the higher its degree of corroboration, ceteris 
paribus. 

Finally, the result of Theorem 9.3 can be extended to Weak Informa- 
tivity if we make the assumption that irrelevant conjunctions dilute the 
degree of corroboration, rather than not increasing it (proof omitted). See 
also the corresponding remark in the motivation of Irrelevant Conjunc- 
tions (page 218). 

Note that these results are meaningful even for those who are not inter- 
ested in the project of explicating Popperian corroboration (e.g., because 
they are radical subjective Bayesians). Some of the above adequacy con- 
ditions have been proposed for measures of confirmation or explanatory 
power as well; others could be potentially interesting in this context. For 
instance, Bréssel (2013) has recently discussed the condition Continuity, 
which is similar to Strong/Weak Informativity: if the posterior probabil- 
ities of two hypotheses are almost indistinguishable from each other, we 
should prefer the hypothesis which was initially less probable. Hence, 
the above results are also meaningful in the framework of Bayesian Con- 
firmation Theory: they indicate the impossibility of statistical relevance 
measures that capture informativity and predictive success at the same 


time. 
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All this does not yet imply that explicating degree of corroboration 
is a futile project. Rather, it reveals a fundamental and insoluble tension 
between the two main contributing factors of corroboration that Popper 
identifies: predictive success and testability /informativity. Weak Law of 
Likelihood, Screened-Off Evidence and Irrelevant Conjunctions all speak 
to the predictive success intuition, whereas Strong/Weak Informativity 
rewards informative and testable hypotheses. In other words, the pre- 
theoretic concept of corroboration is overloaded with desiderata that point 
in different directions and create insoluble tensions. The point of Theorem 
9.2 and 9.3 is to lay bare these tensions and to suggest ways out of the 
dilemma. Basically, we have four options: (i) to reject one of the (sub- 
stantial) adequacy conditions; (ii) to split up degree of corroboration into 
different sub-concepts that preserve subsets of these intuitions; (iii) to con- 
clude that the explication of degree of corroboration is hopeless and not 
worthy of further pursuit, and (iv) to reconcile the various desiderata in a 
different mathematical and conceptual framework. 


Option (i) would come down to either giving up Weak Law of Like- 
lihood, Screened-Off Evidence, Irrelevant Conjunctions or Strong/Weak 
Informativity. But each of these adequacy conditions for degree of corrob- 
oration has been carefully motivated in the preceding section. Such a step 
would therefore appear arbitrary and unsatisfactory. 


For example, one could propose to endorse a statistical relevance mea- 
sure of degree of confirmation as a measure of corroboration, giving up the 
informativity intuition. This has the advantage of relating corroboration 
to a bunch of statistical and philosophical literature on degree of confir- 
mation, but it comes at the price of stripping corroboration of its defining 
characteristics, and it runs into the objections presented in section 9.1. 


Also, statistical relevance measures generally depend on p(E|—H), ei- 
ther explicitly or via the calculation of p(E) and p(H|E). This creates a 
variety of problems. Consider, for example, a Binomial model where we 
test the null hypothesis Ho : @ = 0.5 against the alternative H, : 6 4 0.5. 
If the observed relative frequency of successes is close to 0.5, for example 
¥ = 0.53, the degree of corroboration of the null hypothesis should not 
depend on the likelihoods p(|@) for very large and very small values of 
6. Such alternatives are logically possible, but apparently irrelevant for 
testing the adequacy of the point null hypothesis 0 = 0.5. However, for sta- 
tistical relevance measures in the spirit of Bayesian Confirmation Theory, 
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this conclusion is inevitable since p(x|0 4 00) = fo p(x|0) p(@) de. The 
probability of the data under the alternative is just the weighted average 
of all the likelihoods. 

Option (ii) amounts to endorsing pluralism for degree of corrobora- 
tion. The model case for this option are probabilistic analyses of de- 
gre of confirmation: some measures, like d(H,E) = p(H|E) — p(H) cap- 
ture the boost in degree of belief in H provided by E, while others, like 
1(H,E) = p(E|H)/p(E|=H), aim at the discriminatory power of E with 
respect to H and —H. However, it is not clear what similarly interesting 
subconcepts could look like for degree of corroboration. Right now, this 
option does not appear to be viable. 

Neither does the pessimistic option (iii) have much appeal, unless con- 
vincing reasons are given why scientists can dispense with the concept of 
corroboration, and hypothesis testing in general. 

This leaves us with option (iv): to change the mathematical framework 
for explicating degree of corroboration. Perhaps it is neither necessary nor 
sufficient to base a corroboration judgment on the joint probability dis- 
tribution of H and E? As noted above, statistical relevance measures of 
corroboration compare the merits of H with the merits of -H, defined as 
the aggregate of alternatives to H. However, a comparison to such an ag- 
gregate does not make much sense in many NHST contexts where we deal 
with a multitude of distinct alternatives Hj, i € IN. Perhaps corroboration 
judgments should be made with respect to the best-performing alternative 
in the hypothesis space, and not with respect to all possible alternatives. 
This is the option that we explore in the next section. 


Toward a New Explication of Corroboration 


Statistical relevance measures of corroboration compare the merits of H 
with the merits of —H, defined as the aggregate of alternatives to H. For 
example, in the above example from the Binomial model, with H : @ = 0.5 
and E : ¥ = 0.53, p(E|=H) would be equal to the value of p(x|6 4 00) = 
fe p(@)p(x|@)d0. This is a property that the above statistical relevance 
measures of corroboration have in common with Bayesian confirmation 
measures, such as the Bayes factor. 

However, a comparison to such an aggregate does not make much 
sense in many NHST contexts where we deal with a multitude of distinct 
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alternatives Hj, i € IN. It seems to be essential that we have many al- 
ternatives in such a testing problem from which we can choose, and not 
just a “one-size-fits-all” probabilistic mixture of them. Perhaps degree of 
corroboration should be measured by comparing Ho to the best available 
alternative, rather than to the collective of alternatives, which inevitably 
contains some very implausible hypotheses. 

In the remainder of this section, we sketch an explication of degree 
of corroboration in a framework with many distinct alternatives to the 
tested hypothesis Hp. As a consequence, Formality has to be dropped 
and degree of corroboration becomes partition-relative: testing Ho with 
alternative =H can lead to different corroboration judgments than test- 
ing Ho with alternatives H = {Hj,Ho,...,Hn} even if sH = Vy<j<, Hi 
(cf. Good, 1960, 1968b,a, 1975). Consider, for example, a test whether a 
medical drug is effective. The null corresponds to a particular parameter 
value Ho : 6 = 6, indicating efficacy at placebo level, and the alternative 
to H; : 6 4 0. Dependent on the practical implications of certain effect 
sizes, we may divide the hypotheses in the following coarse-grained in- 


y “sl 


tervals: “worse than a placebo”, “as good as a placebo”, “slightly better 
than a placebo”, “clearly better than a placebo”, etc. Whereas in other 
testing contexts (e.g., determining the value of a natural constant), a very 
fine grained partition of the alternatives would seem more appropriate. 
We now derive such a measure of corroboration on axiomatic grounds. 
In the explication, we focus exclusively on measuring past performance 
and neglect the testability intuition. It will resurface later, though. The 
first and most substantial requirement states that corroboration judgments 
are made with respect to the best-performing alternative in the hypothesis 


space, and not with respect to all possible alternatives. 


CA1 Corroboration is the minimal weight of evidence in favor of Ho 
when compared to all relevant alternatives, up to rescaling. That is, 
the degree of corroboration that E provides for Ho relative to H can 
be defined as 

Cy (Ho, E) = ae W (Ho, Hi, E) (9.5) 


where W(Ho, Hi, E) quantifies the weight of evidence that E provides 
for Ho and against the specific alternative Hj. 


The idea is that positive corroboration requires that no genuine alterna- 
tive Hj € H be evidentially favored over Ho. On the weight of evidence 


Variation 9: Hypothesis Testing and Corroboration 225 


function W, we make the following constraints: 


CA2 There exists a real-valued, continuous function g : [0,1]? — IR such 
that W(E, Ho, H1) := g(p(E|Ho), p(E|H1)). In other words, weight of 
evidence only depends on the probability of E under Ho and Hj. 


The idea is that weight of evidence is a function of the predictive perfor- 
mance of both hypotheses, in line with Popper’s characterization of cor- 
roboration as indicating past performance. Similar requirements are made 
in Good (1952); Bernardo (1999) and Williamson (2010). Finally, we make 
a convenience-based constraint on g: its range is normalized to [-1,1] and 
it should be represented as a rational function, in the mathematical sense 
of the word: 


CA3 g(x,y) is the simplest function of the form 


jer Die cj x! 
GY) =a 9.6 
g(%y) Vja1 Lear Fx x! y os 


with the properties 


g(1,0) =1 g(0,1) =—1 


From these constraints, we can derive 


Theorem 9.4 CA1-CA3 jointly determine the unique weight of evidence func- 


tion 
_ p(El|Ho) — p(E|Hi) 
W (Ho, Hi, E) = SEIH,) = p(B) (9.7) 


and the corroboration measure 


Cy (Ho, E) = min p(E|Ho) — p(E|Hi) 


Ts p(E[Ho) + p (EF) va 


In other words, we obtain the Kemeny-Oppenheim measure of (con- 
trastive) confirmation, familiar from Variation 2, as a measure of weight 
of evidence. It is ordinally equivalent to the likelihood ratio, that is, the 
Bayes factor. This may be seen as the Bayesian foundation in the expli- 
cation of corroboration. The corroboration measure itself is then equal to 
the degree of confirmation that Ho obtains when pitched against the best- 
performing alternative. Positive degree of corroboration entails that there 


is no superior alternative. 
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We shall now apply this measure to our example of statistical infer- 
ence in a Binomial model. A series of independent and identically dis- 
tributed (i.i.d.) coin tosses is performed. In the figures below, we have 
plotted degree of corroboration as a function of the number of successes 
(e.g., “heads”) for the null hypothesis Ho : 6 = 0.5 and the sample size 
N = 100. The degree of corroboration is plotted as a function of the 
number of observed successes (x-axis), for three different partitions of the 
alternative hypotheses. The green dots correspond to classical Bayesian 
hypothesis testing with only a single alternative (=the probabilistic mix- 
ture of the H;). The orange dots report the results for the best-performing 
alternative in the set of intervals [0,0.1), [0.1,0.2), etc. The blue dots, fi- 
nally, are derived from the maximally fine-grained partition, that is, the 
best-performing point alternative in the interval [0,1]. The left figure uses 
a uniform weighting of point values within the intervals that represent 
alternative hypotheses. The second figure uses a slightly centered weight- 
ing (6(2,2)). As visible from the plots, there are no significant qualitative 
differences between the weightings, so we will disregard them from now 


onwards. 
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Figure 9.2: Degree of corroboration of the hypothesis Ho : 6 = 6 plotted 
against number of observed successes, for sample size N = 100. The 
green dots correspond to the alternative H = {[0,1]}. The orange dots 
correspond to H = {[0;0.1),[0.1,0.2),...}. The blue dots correspond to 
H. = (0,1). Left figure: weighting (1,1); right figure: weighting (2,2). 


We see that for the coarse-grained partition (green dots), degree of cor- 
roboration is positive until x = 60: the performance of the alternative is 
dragged down by the extreme alternatives close to zero and one (their 
score is mixed with the well-fitting hypotheses). This is also the result 
yielded by the Bayes factor. Degree of corroboration diminishes if the al- 
ternatives are more fine-grained (e.g., for alternatives of the type [0,0.1), 
(0.1,0.2), etc.). The break-even point is N = 55. For the maximally fine- 
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grained alternative (=every point hypothesis is a potential alternative), de- 
gree of corroboration is always negative. This is actually very natural: 
when each parameter value is a serious scientific option, how can a point 
null hypothesis be ever corroborated unless the sample mean agrees ex- 
actly with the hypothesized parameter value? 

These findings suggest that more fine-grained partitions lead to a 
smaller degree of corroboration, ceteris paribus. Indeed, we can verify 
this claim: 


Theorem 9.5 If H is a subpartition of H’, then Cy(Ho,E) < Cy (Ho, E), pro- 
vided that the alternative hypotheses are weighted in the same way. 

This property shows that the testability of the alternatives affects the 
degree corroboration of the null hypothesis. If the alternatives are very 
specific and testable, the degree of corroboration of the null hypothesis is 
lower than if the alternatives are quite unspecific. Hence, Popper’s two 
crucial aspects of corroboration—past performance and testability—have 
finally been reconciled, although not with respect to the null hypothesis 
itself, but with respect to the alternative hypothesis. 


Discussion 


After studying the formal properties of our corroboration measure, we 
now proceed to a philosophical evaluation. First, we list several essential 
features of Cy that distinguish it from p-values in NHST and Bayesian 


confirmation measures. 


1. The explication of corroboration is sensitive to the partition of hy- 
potheses against which the null is tested. This is the key concep- 
tual move in this section. The Bayesian framework conceptualizes 
the alternative hypothesis as the probabilistic mixture of all point 
values different from @ = 6. However, we argue that it is of- 
ten fruitful to think about the alternative as a set of hypotheses, 
e.g., intervals that correspond to a certain scientific conclusion (e.g., 
small/sizeable/very large effect). 


2. The explication is entirely independent of raising one’s confidence 
in the tested hypothesis and therefore distinctive of corroboration as 
opposed to confirmation (—> Objection 1). The measure of corrobo- 
ration is constructed as a comparison of the null hypothesis with the 
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best possible alternative. This is in agreement with scientific reason- 
ing, where we accept a theory only if it outperforms the alternatives. 


3. Hypotheses with prior probability zero can be corroborated as well 
(—+ Objection 2). That we are not prepared to bet on the truth of 
a precise point hypothesis, regardless of the betting odds, does not 
preclude that this hypothesis can perform well and be corroborated 
with respect to certain competitors. 


4. The explication is asymmetric, respecting that the role of the null 
hypothesis and the alternative are not interchangeable (—> Objection 
3). This preserves an important feature of NHST without buying 
into their methodological flaws. 


Notably, subjective elements are still present in corroboration-based hy- 
pothesis testing. First and foremost, in the partitioning of alternatives. To 
what extent is the null hypothesis a good idealization of reality, and what 
are the error margins that we are willing to accept? What is a scientifically 
meaningful effect size, and which differences can be neglect? Second, in 
the weighting of point hypotheses within the alternatives. This may be 
negligible for very fine-grained hypotheses, but it may substantially affect 
the outcome for fairly coarse-grained partitions. Third, the Bayes factor 
emerges as the degree of corroboration for a maximally coarse-grained 
partition (H = —Ho). In other words, Bayesian hypothesis testing can be 
represented as a special case of evaluating hypothesis tests in terms of 
degrees of corroboration. 

This brings us to questions for future research. An obvious project is 
the aforementioned reconciliation of Bayesian inference and NHST within 
a corraboration-centered perspective, and in particular, examining the hy- 
pothesis that our explication of corroboration unifies Bayesian and non- 
Bayesian hypothesis testing. Second, the proposed corroboration measure 
needs to be applied to more complicated cases of statistical inference, in- 
cluding nuisance parameters, hierarchical models and model selection. In 
this context, it would also be challenging to see what kind of meaning the 
notorious p-value obtains within a corroboration-based framework. Third, 
one may conduct case studies that reconstruct specific episodes of scien- 
tific reasoning as guided by corroboration judgments, and to see whether 
these episodes fit into a probabilistic explication of degree of corrobora- 


tion. 
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The next variation stays with the topic of statistical inference and fo- 
cuses on the problem of model selection: comparing classes of statistical 
hypotheses that are parametrized by one or several variables. Similar to 
the question of whether testability is conducive to degree of corrobora- 
tion, investigated in this variation, we will ask the question of whether 
simplicity should be a cognitive value in model selection. 
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Proofs of the Theorems 


Proof of Theorem 9.1: We begin with showing that condition VII implies 
the Weak Law of Likelihood (WLL). Assume p(H;) = p(Hz2). We distin- 
guish two jointly exhaustive cases in which WLL may apply: 


Case 1: p(E|H,) > p(E|H2) Case 2: p(E|H;) = p(E|H2) 
and p(E|7=H,) < p(E|7H2). 


For the first case, the proof is simple in virtue of the inequality 


p(B|i) 
p(E) 
Then, VII guarantees that c(H),E) > c(H,E). 
For the second case, let x := p(E|H,) = p(E|Hz) and y := p(Hi) = 
p(Hz2). We know that 


p(E|Ho) 


> p(H2) = p(H2|E). 


(EI) = y= gy [P(E IHa) pHa) + p(E lH, THe) (>Hi, 7H) 
= pay OY + POE) +H, He) p(>H, 7H) 
p(E|Ha) = pg [pCEIHn) pH) + P(E lH, 7Ha)p(Hh, Hh) 


1 
= fag re Hy, 1H2 ) p( Hy, 'H2)). 


Hence, p(E|—H,) = p(E|=H2). On the other hand, we have assumed that 
p(E|7=H,) < p(E|7H2). This shows that the second case can never occur 
and may be dismissed. 


We now prove the second implication, that is, VII = Screened-Off Ev- 
idence. To this end, remember that condition VII reads 


VII If p(H) = p(H’), then c(H,E) < c(H’,E’) if and only if p(H|E) < 
p(H'|E’). 


Assuming H = H’, it is easy to see that VII implies 


VII’ If p(H\E) = p(HIE’), then c(H, E) = c(H,E’). 


Variation 9: Hypothesis Testing and Corroboration 231 


The reason is simple: If p(H|E) = p(H|E’), then also p(H|E) < p(H|E’) 
and the ‘<’ direction of VII implies c(H,E) < c(H,E’), where H has 
been substituted for H’. Now we repeat the same trick with the premise 
p(H\E’) < p(H|E) and we obtain c(H,E’) < c(H,E). Taking both inequal- 
ities together yields the conclusion c(H, E) = c(H,E’) and thereby VII’. 

Notice that under the conditions of Screened-Off Evidence, p(h|E, A 
Ey) = p(H\E;). This is so because 


p(Ey A Eo|H) 
p(h|E, AE2) = pA) AE) 
= ro = MRE 


Hence, we can apply VII’ to the case of Screened-Off Evidence, with e := e, 
and e’ := E; \ Ey. This implies 


c(H, FE; /\ E>) = c(H,E1), 


completing the proof. 


Finally, we have the implication VIIIc > Irrelevant Conjunctions. Let 
for H,H’,E € £ and p ©€ the conditions of Irrelevant Conjunctions 
([1] to [4]) be satisfied. Since H — E, VIlIc implies that c(H,E) and 
c(H \ H’,E) are increasing functions of the probability of the tested 
hypothesis—p(H) and p(H AH’), respectively. But by assumption, we 
have p(H AH’) < p(H). Hence, it follows that c(H A H’,E) < c(H,E). 


Proof of Theorem 9.2: By Weak Informativity and Formality, there are 
x>yand 2> 2 with +2°< Lls+a2-22 y Sz and I-px7d —2' > 
y > xz’ such that 


f(x%yz) S fzy,z'). 
Choose a probability function p such that p(H;) = z, p(H2) = Zz’, 


p(Hi A H2) = 0, p(E|H1) = p(E|H2) = x, p(E) = y. We now verify that 
this distribution satisfies the axioms of probability. Because of xz > xz’ 


and 1+x%z—z <1+4xz'—2’, it suffices to verify the inequalities y > xz 
and. STE ez, 
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First note that 


p(E) = p(E|Ai)p (Hi) + p(E|H2)p(H2) + p(E|-H1, He) (1 — p(Ai) — p(H2)) 


which translates, setting w := p(E|7=H,,-H2), as 


y =xz+xz'+w(1-—z—-2z’). 


y—xz = xz+xz'+w(l—z—2')—xz 
= xz’+w(1—-z—z’) 
> 0 
ltxz—-z-y = 14xz-z-—xz-—xz'-w(1-z-2’) 
= (1—z-—x2')+w(1-—z-2z’) 
>= 0 


In both cases, all summands are greater or equal than zero because z + 
z' < 1 by assumption. This completes the proof that the above probability 
distribution is well-defined. 

Now it is straightforward to show that 


1 


p(E|-Hi) = 1p) p(E|H2)p(H2) + p(E| 7H, 2) p(>H1, >H2) 
= iS p(E|H1)p(H2) + p(E| 7H, 2) p(>H1, >H2) 
p(E|Ha) = 7g [pCE|Hn) pH) + P(E lH, 7H) p(>Hh, 7H) 


because by assumption, p(E|H;) = p(E|Hz2). From this we can infer 


p(E|>H1) — p(E|-H2) 


p(E|A1) p(H2) | p(E|7Hi, >H2) p(>Hi, 7H) _ p(E|Hi) p(H1) 


1 — p(H1) 1— p(Hj) 1 — p(H2) 
p(El 1H, 'H2) p( 1H, 'H2) 
1 — p(H2) 
= p(BiHy) | P02) _ PO) | at, aig) (1 — p(Eh) — p(B) 


bape) - Lids) 
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H;) — p(H2)) - (p(A1) + p(H2) — 1) 
= & ace ana (p(E|Hi) — p(E|-Hi, -H2)). 
If we look at the signs of the involved factors, we notice first that p(H1) — 
p(H2) = z-2z' > 0 and p(H,) + p(H2) -1 = z+2z'’-1 < 0. Then we 
observe that H; and Hp were disjoint and that p(E|H;) and p(E|H2) are 
both greater than p(E), implying p(E|H;) = p(E|H2) > p(E|7H1, 7H2). 
Taken together, we can then conclude 


p(E|=H;) _ p(E|7H2) <0. 


Hence, the conditions for applying Weak Law of Likelihood are satis- 
fied: H; and Hp are two mutually exclusive hypotheses with p(E|H;) = 
p(E|Hz2) and p(E|-H1) < p(E|7>H2). Thus we can conclude 


f (x,y,z) = c(H1,E) > c(H2,E) = f(x,y,2'), 


in contradiction with the inequality f(x,y,z) < f(x,y,z’) that we got from 


Weak Informativity. 


Lemma 1 Any measure of corroboration c : £7 x $ — R that satisfies 
Screened-Off Evidence and Formality also satisfies the equality 


f (ax, ay,z) = f(x, ¥,2) (9.9) 


forx >y >0,z>O0and0<a<1withl+x«*z—-z>y> xz. 
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Proof of Lemma 1: For any 0 <a <1,x >y >Oandz > 0 with 1+ 
XZ—Z > y > xz, we can choose sentences H,E1,E2 € £ and a probability 
function p € 8 such that 


a := p(Ep) p(E2,H) = p(E2)p(H) 
x := p(E;|H) p(E1 A E2) = p(E2)p(E1) 

y := p(E1) p(E1 \ E2|H) = p(E2)p(Ei|H) 
z= p(H). 


Since our choice of p is not restricted, this is always possible. Now, 
the conditions of Screened-Off Evidence are satisfied, and it follows that 
c(H, E; A E2)) = c(H, E;). By Formality, we can also derive the equalities 


c(H,E, \E)) = f(p(E1 AEs|E), p(E1 A Ea), p(H)) 
= f(p(E2)p(Ex|H), p(E2)p(E1), p(H)) 

= f(ax,ay, 2) 

c(HE:) = f(xy,2). 


Taking all these equalities together delivers the desired result: 
f (ax, ay,z) = e(H,E; AEz)) = c(H,E)) = f (x,y,z). 


Finally we note that (ax,ay,z) is always in the domain of f when a < 1 
and 1+xz-—z>y> xz: 


(ay) > (ax)/z ay <a(1+xz-—z) 


= axel =z) 
<1+4+ (ax)z-z 


Proof of Theorem 9.3: Choose sentences Hi,H2,E € £ and a probabil- 
ity function p € 8 such that the conditions of Strong Informativity are 
satisfied: 


p(E|H1) = p(E|H2) > p(E); 


(2) 1/2 => p(H1) > p(H2); 


Variation 9: Hypothesis Testing and Corroboration 235 


(3) c(H1,E) < c(H2,E). 


Writing x := p(E|H1) = p(E|H2), y := p(E), z = p(H) and z’ := p(H’), we 
then obtain 


$642) = CyB) = cH, B= 7 (9.10) 


Since c(H, E) satisfies Formality and Screened-Off Evidence, by Lemma 
1 it also satisfies the equality 


fax, ay,z) = f(x, ¥,2) 


forx >y >0,z >0Oand0 <a <1. It is easy to see that (1,y/x,z) is in 
the domain of f if (x,y,z) is. Applying the above equality to f(1,y/x,z) 
and choosing a := x, we now obtain 


f(Ly/x,2) = f(xy,2) F(Ly/x,2') = f(x,y,2'). 


Then it follows from inequality (9.10) and the above equalities that 
fly) <flyya2 (9.11) 


for these specific values of x, y, z and 2’. 

We can now find sentences H, H’, E’ and a probability function p’(-) 
such that the conditions of Irrelevant Conjunctions are satisfied and at 
the same time, p’(H) = z, p'(H AH’) = 2’, p'(E’) = y/x. This implies 
c(HA H’,E’) < c(H,E’). By Formality, this also implies 


F(Ly/x,z) = f(Ly/%,2’). 


However, this inequality contradicts Equation (9.11) that we have shown 


before. Hence, the theorem is proven. 


Proof of Theorem 9.4: We have to show that W(Ho, H1,E) is indeed of the 
form (9.7). From CA2, we know that it must be a function of p(E|Ho) and 
p(E|H1). Now we use CA3 to prove its specific from. 

Assume first that m = 0 and n = 1. In that case, the neutrality con- 
dition fs(Ho,Hi,E) = 0 if p(E|Ho) = p(E|H1) cannot be satisfied unless 
Coo = 0 because the numerator is a constant. Hence, we can neglect this 
possibility. 
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Now assume that m = 1 and n = 0. Here, the neutrality condition 
fs(ho, Hi1,E) = 0 if p(E|Ho) = p(E|H1) leads to the equation 


coo + (c1o + cor) p(E|Ho) + crip(E|Ho)* = 0 (9.12) 


which is satisfied in general if and only if cog = cy = 0 and cjg = —co. 
Clearly, the resulting function f(Ho,H1,E) = p(E|Ho) — p(E|H1) is not 
ordinally equivalent to S(Ho,E) — S(H1,E) = log p(E|Ho) — log p(E|H1), 
regardless of the value of cig and the base of the logarithm. Hence, we can 


neglect this possibility, too. 

Now assume that m = n = 1. Again, the neutrality condition leads to 
the conclusion co9 = cy, = 0 and cy9 = —Co. Now, let us set p(E|Ho) = 1, 
p(E|Hi) = 0, and vice versa. Then, the maximality constraint implies 
dig = do, = 1 and the simplest function that maintains ordinal equivalence 
with S(Ho,E) — S(Hj,E), as demanded by CAI’, is obtained by setting 
doo = dy, = 0. 

From this result, the theorem follows by a simple application of CA1. 


Proof of Theorem 9.5: Without loss of generality, we can restrict our- 
selves to the case of two disjoint alternatives H = {Hi VHp»}, H’ = 
{H,H2}, with p(E|H;) > p(E|Hz2), This is because the relative weight- 
ing of the elements of H; and H2 stays the same. Now we observe 

1 
p(y V H2) 
1 
p(Hi) + p(H2) 


p(E|Ha) 


p(E|Hi V Hp) (p(E|Hi) p(H1) + p(E|H2) p(H2)) 


V 


(p(E|H2)p (Hi) + p(E|H2)p(H2)) 


Hence Cy(Ho, E) > Cy (Ho, E), irrespective of the value of p(E|Hp). 


Variation 10: Simplicity and Model 
Selection 


“Numquam ponenda est pluralitas sine necessitate.” 
(William of Occam) 


Is simplicity a cognitive value? Is it indicative of a good scientific 
theory? Few questions in philosophy of science are older, and few have 
been debated more controversially. The thesis that simple demonstrations, 
scientific theories or ontological systems are more valuable than complex 
ones has already been defended by great philosophical minds such as 
Aristotle, Aquinas and Kant, e.g.: 


If a thing can be done adequately by means of one, it is su- 
perfluous to do it by means of several; for we observe that 
nature does not employ two instruments where one suffices. 
(Aquinas, 1945, 129) 


Indeed, the belief that simplicity is a cognitive value is often backed by 
an ontological assumption that among different ways nature could be, the 
simple one is more likely to be true. In a weaker version of that thesis, 
simplicity is not necessarily seen as truth-conducive, but as contributing 
to the success, verisimilitude, predictive accuracy or rational acceptability 
of a theory. Thomas S. Kuhn (1977a) also includes simplicity in the list of 
standard criteria for scientific theory choice. 

Opponents reply that this belief is unjustified: we have no reason to 
assume that nature is simple rather than complex. On that view, simplicity 
is nothing more than a pragmatic value related to our cognitive limitations 
as human beings (e.g., van Fraassen, 1980). Simple theories are easier to 
handle than complex ones, be it for purposes of prediction, explanation, or 
further theoretical development. Thus, to what extent is simplicity related 
to the success of science? 
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To answer this question from a Bayesian perspective, we have to dis- 
tinguish between different dimensions of simplicity in scientific inference. 
The first distinction concerns the syntactic and the semantic dimension 
of simplicity. The semantic dimension is concerned with the ontological 
implications of that theory. How many entities does the theory postulate? 
Are they all of the same kind? And so on. This dimension of simplicity is 
called parsimony (Nolan, 1997; Baker, 2003, 2010). Again, the thesis that 
parsimonious theories are to prefer to less parsimonious theories has two 
aspects, one pertaining to the number of entities postulated by the the- 
ory in question, and one pertaining to plurality in the kinds of postulated 
entities. Both of them are fundamental questions in the metaphysics of 
science. We feel that there is little that the Bayesian framework—which is 
primarily a tool for uncertain reasoning—can contribute to deciding these 
questions, and leave them aside in what follows. 

The syntactic dimension of simplicity is more interesting for our pur- 
poses. It deals with the way scientific theories are formulated: How many 
hypotheses are postulated? How complex are they? Can they be related 
to each other in a straightforward way? Discussions about the role of sim- 
plicity in curve-fitting and other everyday elements of scientific practice 
belong into this systematic dimension which the survey article by Baker 
(2010) calls elegance. In the remainder of the variation, whenever we write 
“simplicity”, we refer to the syntactic dimension of simplicity as elegance. 
We focus on Bayesian accounts of simplicity, that is, on the question of 
whether it is rational to prefer simpler theories to more complex ones 
for purely epistemic reasons. These accounts will also be contrasted with 
non-Bayesian explications of simplicity in model selection. 


Simplicity in Model Selection 


The debate about simplicity as elegance has, in recent decades, focused 
on the role of simplicity in model selection. This involves the comparison 
of different statistical models on the basis of their fit with a dataset and 
the intrinsic properties of these models. A nice feature of model selection 
is that simplicity can be quantified neatly, in terms of the number of free 
parameters that a statistical model posits. This understanding of simplicity 
is also manifest in various model selection criteria and allows for a quite 


rigorous treatment of the role of simplicity. Here and in the sequel, the 
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term model selection is used as shorthand for the more general term of 
statistical model comparison or model evaluation. That is, model selection 
need not refer to a choice of the modeler, or a decision to reject/accept a 
particular model. It can also lead to a judgment about which model is, all 
things taken together, superior to its competitors. 

In the last 20 years, there has been a boom of papers on the value 
of simplicity in statistical inference, prompted by Forster and Sober’s 
1994 paper in BJPS. In that paper, the authors challenge van Fraassen’s 
view that simpler models are preferred solely on non-empirical, pragmatic 
grounds, such as mathematical convenience. They substantiate their thesis 
with the help of the Akaike Information Criterion (AIC), a statistical model 
selection criterion that involves a tradeoff of simplicity and goodness-of- 
fit. Along these lines, they argue that the simplicity of a model contributes 
to its predictive success. 

In Forster and Sober’s view, the real epistemological question sur- 
rounding simplicity is not whether simple models are more likely to be 
true, but whether they are predictively accurate (Forster, 2002; Sober, 
2002). After all, in modeling economic growth, climate change, social 
decision-making, etc., statistical models are just idealizations of an ex- 
cessively complex reality. Simple models, however, may capture salient 
aspects of reality, whereas complicated models only muddle the waters by 
introducing effects that do not really matter. According to this argument, 
simplicity is a genuine cognitive value because it contributes to attaining 
another cognitive value, namely predictive accuracy, whose centrality for 
the scientific enterprise stands undisputed (Kuhn, 1977a; McMullin, 1982, 
2008; Douglas, 2013). 

We can distinguish two questions regarding the relation between sim- 
plicity and predictive accuracy: 


1. The qualitative question: Do simple models tend to be more predic- 
tively accurate, ceteris paribus, thereby vindicating simplicity as a 


genuine cognitive value? 


2. The quantitative question: What is the weight of simplicity vis-a-vis 
other cognitive values (e.g., goodness-of-fit) in model selection)? 


In this variation, we shall argue for an affirmative answer to the first 


question. Forster and Sober, however, go beyond this: they argue that the 
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epistemic pull of simplicity is established by means of the mathematical 
properties of a particular model comparison criterion, Akaike’s informa- 
tion criterion AIC. By contrast, we argue that a tradeoff between simplicity 
and goodness-of-fit is bound to be highly context-dependent and cannot 
be captured by a single criterion. 

The rest of the variation is structured into three major sections: one 
dealing with the qualitative rationale for preferring simpler models in 
model selection (Section 10.2), and two dealing with the quantitative trade- 
off rate between simplicity and goodness-of-fit. The first of these sections 
reviews a non-Bayesian model selection criterion (Section 10.3), the second 
a Bayesian model selection criterion (Section 10.4). We also investigate 
whether there is a specific Bayesian way to make simplicity matter, that is, 
a model selection criterion where a direct link between simplicity and pos- 
terior probability can be established. We observe that Bayesian inference 
often plays an instrumental role in model selection, rather than providing 
a genuine philosophical underpinning of a particular procedure. Finally, 
we summarize our findings (Section 10.5) and provide additional mathe- 
matical details (Section 10.6). 


Curve Fitting and Estimation Error 


Statistical model analysis compares a large set of candidate models to 
given data, in the hope to select the best model on which predictions, 
explanations and further inferences can be based. Often, it is unrealistic to 
assume that the “true model” (i.e., the data-generating process) is found 
among the candidate models: data sets are often huge and messy, the 
underlying processes are complex and hard to describe theoretically, and 
contain lots of noise and confounding factors. Furthermore, the candidate 
models are often generated by automatic means (e.g., as linear combina- 
tions of potential predictor variables). This means that they usually do 
not provide the most striking mechanism, or the best scientific explana- 
tion of the data. Rather, they are constructed from the data, supposed 
to explain them reasonably well and to be a reliable device for future 
predictions. This “bottom-up” approach (Burnham and Anderson, 2002; 
Sober, 2002) to curve-fitting and model-building is complementary to a 
“top-down” approach where complex models are derived from general 
theoretical principles (Weisberg, 2007). 
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How does this work in a concrete case? Consider fitting a scatterplot 
such as Figure 10.1 with polynomial curves. Assume that we describe the 
relationship between an independent (“input”) variable x and a depen- 
dent (“output”) variable y either by a linear or by a quadratic polynomial, 
together with i.i.d. noise terms ¢; (e.g., €; ~ N(0,1)) and coefficients a, B, 
and . Then, the data points D = {(xj,y;),1 < i < N} are described by 
the equations 


(LIN) Yi = a+ Pxj+e; (10.1) 
(PAR) yi = &+Bxityxi tei (10.2) 


The linear model (LIN) now corresponds to the null hypothesis Ho : y = 0 
and the quadratic model (PAR) to the more general alternative H; : y 4 0. 
The ordinary method for fitting the linear model to the data is the method 
of ordinary least squares (OLS): the parameters # and f are chosen such 
that the thus-defined curve y = & + Bx makes the data D = (x;,y;) most 
likely among all pairs (a,f). If the error terms are iid. and follow a 
Gaussian (=Normal) distribution, this is equivalent to minimizing the sum 
of the square of the residuals, );(y; — « — Bx;)*, that is, the variation in the 
data that cannot be explained by assuming the model (LIN). 

For simple linear regression, the values (a, 6) that minimize the above 
sum can be calculated analytically: B = Cov(X,Y)/Var(X), & = 7 — Bi. 
For more complex regression models, numerical methods are normally 
used. But the idea stays the same: to use the maximum likelihood esti- 
mate (MLE), that is, the parameter values that account best for the residu- 
als. 

Figure 10.1 shows how linear and quadratic curves are fitted to the 
data. Notably, the values of y increase in both directions in the quadratic 
model, but not in the linear model. For higher order polynomials, the 
effect would even be greater; the curve would be highly oscillatory and 
there would be many additional extremal and inflection points for which 
we would not have a scientific explanation. 

In general, complex models can achieve a satisfactory fit for more data 
sets than simple models can. These superior fitting resources can also 
be a vice as the problem of overfitting illustrates: the more degrees of 
freedom a model has, the more difficult it is to simultaneously estimate 
all model parameters. We will often fit noise to the data. This is some- 
times phrased in the words that complex models have high estimation 


242 10.2. Curve Fitting and Estimation Error 


& 
rm 
xo4 
nh 
5 


Linear Form 


Quadratic Form 


Figure 10.1: A linear model (LIN, green line) and a quadratic model (PAR, 
orange line) are fitted to a scatterplot of data according to the ordinary 
least squares method. 


variance. Simultaneously estimating numerous parameters is more diffi- 
cult and error-prone than only estimating one or two of them. Notably, 
complex models can perform worse than simpler models even if they are 
closer to the data-generating process. 

This problem is aggravated by the frequent use of MLE’s: maximum 
likelihood estimation is usually overoptimistic with respect to the predic- 
tive performance of the chosen model, especially when the number of ad- 
justable parameters is high. An MLE always selects the best-fitting model 
in a model class. Projecting the current goodness-of-fit to future predic- 
tive accuracy just neglects the problem of high estimation variance and 
the danger of overfitting. While there is a strong epistemic intuition that 
we should try to find the correct model, even it is very complex, a predic- 
tive perspective may advocate to prefer a wrong, but stable simple model: 
the joint estimation of the coefficients in the complex model will almost 
always lead to misleading parameter values, and thus to bad predictions. 

Therefore it appears natural to assign an epistemic value to simplicity 
in curve fitting and model selection. Null hypothesis significance testing 
takes this into account. When a null hypothesis Hp : 6 = 6 is tested 
against the alternative H; : 6 # 0, the alternative is more complex than 
the null since it has one additional degree of freedom: the parameter 0 


Variation 10: Simplicity and Model Selection 243 


takes no definite value under H,. This allows Hy, to fit the data better 
than Ho. On the other hand, if the population mean is not exactly equal 
to 69, but quite close to it, the null hypothesis still does a good job. In 
general, the null hypothesis makes more precise (though not necessarily 
more accurate) predictions than the alternative, and it is easier to use in 
theoretical developments. Therefore, the statistical thresholds for rejecting 
the null in favor of the alternative are typically high (e.g., the famous p < 
0.05). By imposing high standards before calling observed data significant 
evidence against Ho, we compensate for the fact that complex hypotheses 
have it easier to achieve a good fit, even if they are mistaken. 


There is a striking resemblance, by the way, between this viewpoint and 
Popper’s idea that good scientific theories should trade off simplicity— 
which is associated with being informative—and predictive accuracy: 
“Science does not aim, primarily, at high probabilities. It aims at a high 
informative content, well backed by experience.” (Popper, 2002, 416, orig- 
inal emphasis) Such a viewpoint can, in turn, be related to truthlikeness 
or verisimilitude as a primary goal of science, and it is possible to find a 
fruitful role for simplicity in that paradigm (e.g., Oddie, 1986; Niiniluoto, 
1999). 


One note of caution, though. The understanding of simplicity as num- 
ber of free parameters works well for polynomial models and similar cases, 
but not across the board. Some models are deceptively simple. For exam- 
ple, with only two free parameters (a and fB) we can construct a functional 
dependency f(x) = «sin(6x) such that all data points in a dataset D are 
fitted up to an arbritrary amount of precision, notwithstanding the size of 
D. But certainly, this model is has many features that are hard to make 
sense of scientifically, such as the extreme oscillation of f as a function of 
x. Comparing different hypotheses in terms of simplicity is thus relative to 
a particular model family (e.g., polynomials) in which they are embedded. 


The problem of estimation variance establishes that simplicity can be 
a cognitive value in statistical inference: it helps us to make more reli- 
able estimates. But can we also determine an optimal tradeoff rate be- 
tween simplicity and goodness-of-fit? This thesis has been defended with 
respect to Akaike’s model comparison criterion AlC—a criterion that is 
widely applied in scientific reasoning and has attracted particular interest 
in ecological modeling (Burnham and Anderson, 2002, 2004). 
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The Akaike Information Criterion (AIC) 


This section takes a closer look at the Akaike Information Criterion (AIC) 
proposed by Akaike (1973) and Sakamoto et al. (1986) as a representative 
of non-Bayesian ways to make simplicity matter for predictive accuracy. 
Partly, we chose AIC because of its historical role in model selection and 
partly, because it has been used by philosophers of science as a showcase 
for making general epistemic claims about the role of simplicity in sci- 
entific inference (Forster and Sober, 1994; Forster, 1999, 2000; Forster and 
Sober, 2010; Sober, 2002, 2008). 

The AIC tries to estimate the discrepancy between the candidate model 
and the unknown true model. A popular metric for this discrepancy is the 
Kullback-Leiber divergence 


Dic(fllge) =f F(x) log a on 


= | f(x) log f(x)dx — f f(x) loggo(x)dx (10.3) 


known from Variation 1. Here, f denotes the probability density of the 
unknown true model f, gg is a class of candidate models indexed by pa- 
rameter 0, and the integral is taken over the sample space (=the set of 
observable results). As stated before, Kullback-Leiber divergence is used 
in information theory to measure the loss of content when estimating the 
unknown distribution f by an approximating distribution gp. 

Of course, we cannot compute KL-divergence directly for a given can- 
didate model gg. First, we do not know the true probability density f. 
This implies that we can only estimate KL-divergence. Second, g¢ is no sin- 
gle model, but stands for an entire class of models with parameter 6. We 
have to use a particular element of gg for the estimation procedure. The 
maximum likelihood estimator gg is a particularly natural candidate: it is 
the model whose parameter values maximize the likelihood of the data, 
given the model. However, if one used the maximum likelihood estimator 
to estimate KL-divergence without any corrective terms, one would over- 
estimate the closeness to the true model. Third, we are not interested in 
KL-divergence per se, but in predictive success. So we should relate (10.3) 
in some way to the predictive performance of a model. Akaike’s (1973) 
famous mathematical result addresses these worries: 


Akaike’s Theorem: For observed data y and a candidate model 
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class gg with K adjustable parameters (or an adjustable parameter of 
dimension K), the model comparison criterion 


AIC (go, N) := — log gay) + K (10.4) 


is an asymptotically unbiased estimator of 
EE, [log (f(x) /gaiy)(x))|—the “expected predictive success 


of 86" 

In the above equation, g%,) denotes the probability density of the max- 
imum likelihood estimate 6(y). To better understand the double expec- 
tation in the last term, note that the maximum likelihood estimate gq is 
determined with the help of the data set y. Then, gj’s KL-divergence 
to the true model f is evaluated with respect to another set of data x. 
This justifies the name predictive success, and taking the expectation two 
times—over training data y and test data x—justifies the name expected 
predictive success. 

In other words, AIC estimates expected predictive success by subtract- 
ing the number of parameters K from the log-likelihood of the data under 
the maximum likelihood estimate gj. It gives an asymptotically unbiased es- 
timate of predictive success—an estimate that will, in the long run, center 
around the true value. The more parameters a model has, the more do we 
have to correct the MLE estimate in order to obtain an unbiased estimate. 
We are then to favor the model which minimizes AIC among all candidate 
models. According to Forster and Sober, 


Akaike’s theorem shows the relevance of goodness-of-fit and 
simplicity to our estimate of what is true. But of equal im- 
portance, it states a precise rate-of-exchange between these 
two conflicting considerations: it shows how the one quan- 
tity should be traded off against the other. (Forster and Sober, 
1994, 11) 


Moreover, they use Akaike’s theorem to counter the (empiricist) idea that 
simplicity is a merely pragmatic virtue and that “hypothesis evaluation 
should be driven by data, not by a priori assumptions about what a ‘good’ 
hypothesis should look like [such as being simple, the author]” (Forster 
and Sober, 1994, 27). By means of Akaike’s theorem, simplicity is assigned 
a specific weight in model selection and established as a cognitive value. 
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However, this argument does not stand on firm grounds. First, un- 
biasedness is not sufficient to ensure the goodness of an estimator. The 
goodness of an estimator 6 relative to the true value 6 is usually measured 
by the mean square error, which can be written as the square of the bias 


plus its variance. 


a 


MSE(6] = (E[6 — 6])° + E[(0 — 6)”] 


If 6 is unbiased, the first term will disappear, but this does not ensure low 
overall error—an unbiased estimator may have high variance, dissipate far 
from the true value and be awfully bad in practice. In particular, it may 
be outperformed by an estimator with low variance that is only slightly 
biased. By itself, unbiasedness does not warrant good performance. 

This objection may be countered by noting that unbiasedness is an ad- 
vantage, ceteris paribus. Forster and Sober (2010) note that AIC and one of 
its rivals, the Bayesian Information Criterion BIC, just differ by a constant. 
If estimators differ by a constant, they have the same variance. Since mean 
square error = square of bias + variance, the unbiased estimator will have 
the lower mean square error. Hence BIC seems to be a worse estimator of 
predictive accuracy than the unbiased AIC. 

However, this argument is based on an oversight which many authors 
in the debate commit (Forster and Sober, 1994, 2010; Kieseppa, 1997). AIC 
is not an unbiased estimator—it is just asymptotically unbiased, in other 
words, the property of unbiasedness is only realized for very large sam- 
ples. To the excuse of these authors, it should be added that the standard 
AIC textbook by Sakamoto et al. (1986, 69) sometimes uses this formula- 
tion in passing. Several other passages (Sakamoto et al., 1986, 65,77,81) 
make clear, however, that the word “unbiased”, when applied to AIC, is 
merely used as a shortcut for “asymptotically unbiased”. 

To see this with your own eyes, we invite you to have a look at the 
mathematical details in Section 10.6. There, the dependence of Akaike’s 
Theorem on the asymptotical, and not the actual normality of the max- 
imum likelihood estimator becomes clear. This has substantial conse- 
quences, and speaks against a normative interpretation of Akaike’s find- 
ings. AIC outperforms BIC as an estimator of predictive accuracy only 
for an infinitely large sample, whereas actual applications usually deal 
with medium-size finite samples. As long as we don’t know the speed of 
convergence—and this varies from data set to data set—, the asymptotic 
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properties are unwarranted. This is good news for those who want to de- 
fend Bayesian model selection against its non-Bayesian competitors such 
as AIC. 

OFinally, the contribution of simplicity relative to goodness-of-fit in 
Akaike’s Theorem diminishes as sample size N increases, as Forster (2002) 
notes himself (see Section 10.6 for details). The goodness-of-fit term is of 
the order of N whereas the simplicity-based contribution remains con- 
stant. Thus, with increasing sample size, simplicity drops out of the 
picture. In the light of these observations, claims that AIC establishes a 
tradeoff rate between simplicity and goodness-of-fit (thereby establishing 
simplicity as an epistemic value) shine in a dim light. 


The Bayesian Information Criterion (BIC) 


Are there model selection criteria that are firmly anchored within, and de- 
rived from Bayesian reasoning? The classical, subjective view of Bayesian 
inference consists in reasoning from a prior to a posterior probability. A 
model selection procedure is called Bayesian if it is based on the poste- 
rior distribution of degrees of belief, or on the difference between prior 
and posterior probabilities. This was also the rational behind the manifold 
measures of confirmation presented in Variation 2. 

An example for such a procedure is model selection based on Bayes 
factors. They compare the performance of the rivalling models Ho and Hy 
by means of the ratio 

Be Gaye CEE), Pe) BET) 

PCAA|E) p(Ho) — p(E|H:) 
Citing work by Spiegelhalter and Smith (1980), Kass and Raftery (1995, 
790) argue that Bayes factors act as a “fully automatic Occam’s razor” for 


nested models: when the Bayes factor favors a simple model, the complex 
model will be penalized for hosting lots of poor-fitting hypotheses. In that 
case, the loss in predictive accuracy by accepting the simpler model will be 
negligible. The search for an explicit tradeoff rate between simplicity and 
goodness-of-fit is replaced by the rate that is implicit in Bayesian inference 
and Bayesian measures of evidence. 

However, this orthodox Bayesian model selection is not that frequently 
put into practice. First of all, there is a plethora of practical and method- 
ological problems, such as are the computational costs of calculating pos- 
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terior distributions or handling nested models in a Bayesian framework. 
Second, when prior probabilities are assigned, reliable expert opinion is 
usually hard to elicit so that the choice of the prior is often dominated by 
mathematical convenience. Furthermore, results may be highly sensitive 
to the prior distribution. This has triggered the search for model selec- 
tion criteria that can play a useful role for approximating Bayes factors. 
Schwarz’s Bayesian Information Criterion (BIC) has often been claimed to 
fulfil that role; so we will investigate its foundations in some detail. We 
claim that the findings in our analysis of BIC are somewhat typical of 
Bayesian model selection in general. For a philosophical analysis of fur- 
ther Bayesian model selection techniques, such as the Minimum Message 
Length Criterion (MML), see Dowe et al. (2007) and Sprenger (2013c). 

The BIC is an estimation procedure that aims at the posterior probabil- 
ity of a parametric model Mg, that is, at the weighted sum of the posterior 
probabilities of the hypotheses in Mg that correspond to different values 
of 6. Thus, it has a different target than AIC, which compares the best- 
performing representatives of a class of models. We will now reconstruct 
and analyze the motivation of BIC, following Schwarz (1978). 

Assume that Mg is one of our candidate models, whose elements are 
indexed by a vector 6 with dimension K. We would like to approximate 
the posterior probability of My. Assume further that all probability den- 
sities for data x (with respect to the Lebesgue measure @) belong to the 
exponential family and that they can be written as 


p(x|0) = eN(AG)—Ale—-0() 7), (10.5) 


Here, 6(x) denotes the maximum likelihood estimate of the unknown 6, 
and N the sample size, assuming independent sampling. This specific 
form of the likelihood function seems to make a substantial presump- 
tion, but in fact, the densities in (10.5) comprise the most familiar dis- 
tributions, such as the Normal, Uniform, Fisher, Poisson and Student’s 
t—distribution. For that reason, the assumption is plausible from a practi- 
cal point of view. 

Then we take a standard Bayesian approach and write the posterior 
probability of My, as proportional to the prior probability p(Mg) and the 
averaged likelihood of the data x under Mog: 


p(Mo|x) ~ p(Mp) i eN(Aa)-A1e-8) 9) 9 (9) 
6<O 
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= p(Me) eNA(*) f eNAle-8(2)I? 9). 
eo) 


Substituting the integration variable @ by 6/ NA, and realizing that for 
the maximum likelihood estimate 6(x), p(x|6(x)) = eN4, we obtain 


as 
log p(Me|x) ~ log p(Ms») + NA(x) +108 (555 i) + log fe 0 |0-8(x 


= logp(Ms)+NA(x) + 5Klog (a ;) + log Jn* 


log p(Me) + log p(x|8(x)) — sKlog (=) 


Let us take stock. On the left hand side, we have the log-posterior prob- 
ability, which can be interpreted as a subjective Bayesian’s natural model 
selection criterion. As we see from (10.6), this term is proportional to the 
sum of three terms: log-prior probability, the log-likelihood of the data un- 
der the maximum likelihood estimate, and a penalty proportional to the 
number of model parameters. This derivation, whose assumptions are re- 
laxed subsequently in order to yield more general results, forms the math- 
ematical core of BIC. The number of parameters K enters the calculations 
because the expected likelihood of the data depends on the dimension of 
the model, via the skewness of the likelihood function. 

In practice, it is difficult to elicit sensible subjective prior probabilities 
of the candidate models, and the computation of posterior probabilities 
involves high computational efforts. Therefore, Schwarz suggests to esti- 
mate log-posterior probability in (10.6) by a large sample approximation. 
We neglect the terms that make only constant contributions and focus on 
the terms that increase in N: log p(Mg) drops out of the picture. There- 
fore, in the long run, the model with the highest posterior probability will 
be the model that minimizes 


BIC(Mg,x) = —2log p(x|6(x)) + K logN. (10.7) 


BIC is intended to select the model that accumulates, in the long run, the 
most posterior mass. However, it neglects the contribution of the priors 
when comparing the models to each other. Keeping in mind the identity 


log p(HIE) = log p(H) + log ( p(ElH)- > ) (108) 


and comparing it to Equation 10.6, wee see that BIC could as well be 
described as an approximation to the log-ratio measure of confirmation 


d6(@) 


(10.6) 
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log p(H|E) — log p(H) = log(p(H|E)/p(H)), up to addition of a constant 
(— Variation 2). 


Therefore, BIC can only partially be described as having a properly 
Bayesian justification: while (log-ratio) confirmation may be suitable for 
comparing models on the basis of past performance, it does not conform 
to classical subjective Bayesian inference: the priors drop out of the picture, 
as witnessed by the transition from (10.6) to (10.7). Instead of conforming 
to the subjective Bayesian rationale, the BIC is a hybrid procedure: it does 
not primarily aim at an accurate representation of subjective uncertainty. 
Rather, it uses the Bayesian calculus as a convenient mathematical tool for 
meeting goals that a statistician or modeler may encounter in inference, 
such as selecting models with strong performance on past data. There 
is nothing specifically Bayesian about the estimation target of BIC. This 
finding is, by the way, in agreement with Schwarz’ note that BIC extends 
“beyond the Bayesian context” (Schwarz, 1978, 461). See also Forster and 
Sober (1994, 23-24). Even more, frequentist properties are sometimes in- 
voked in an attempt to justify the practical use of BIC (e.g., Burnham and 
Anderson, 2002). 


To further strengthen this conclusion, note that BIC is quite different 
from a numerical large sample approximation for posterior degrees of be- 
lief: the posterior approximated by BIC is detached from subjective prior 
probability. So BIC is not just a practical approximation to Bayesian co- 
herence. Compare BIC to techniques such as Gibbs sampling or Monte 
Carlo Markov Chains (Han and Carlin, 2001): those techniques aim at 
numerical approximations of subjective posterior distributions, and offer 
computational help for tricky multi-dimensional integrals. BIC follows a 
philosophical rationale that is much less tied to Bayesian reasoning. 


Neither does the statistical consistency of BIC provide a genuinely 
Bayesian justification. Here, consistency does not denote logical consis- 
tency with another proposition, but a long-run property of statistical esti- 
mators. An estimator is consistent if and only if it converges in probability 
to the true model as sample size increases. Both Bayesians and frequen- 
tists regard consistency as a necessary constraint on good estimators, and 
BIC is consistent as long as the overall model is not misspecified (i.e., if 
it contains the true model). Apart from the fact that this condition may 
often fail in practice, consistency alone has no implications on speed of 
convergence to the true value. Hence, it is not a sufficient reason for us- 
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ing a particular method. So neither is consistency in any way peculiar 
to Bayesian inference, nor is it strong enough to make a case for BIC as 
opposed to other methods. 

Our diagnosis that BIC lacks, in spite of the extensive use of Bayesian 
formalism, a fully Bayesian rationale, is supported by the variety of pur- 
poses to which BIC is put. Sometimes it is regarded as an approximation 
to the Bayes factor (Kass and Raftery, 1995). Raftery (1995) proposes an 
interpretation of BIC as an approximation to marginal likelihood, which is 
easily derived on the basis of the above calculations. Romeijn et al. (2012) 
see different worries with a Bayesian understanding of BIC and propose 
to anchor it more securely in Bayesian reasoning by taking into account 
the size of the parameter space. Hence, what the asymptotic analysis of 
BIC approximates is not determined by the mathematics only: it depends 
on the general perspective one adopts. 

Neither is the derivation of BIC committed to the true model being 
in the set of candidate models which is a standard premise of Bayesian 
convergence-to-truth theorems (Blackwell and Dubins, 1962; Gaifman and 
Snir, 1982). All this shows that for BIC, Bayesianism constitutes no philo- 
sophical underpinning (e.g., as a logic of belief revision), but only a 
convenient framework which motivates using a specific estimator of log- 


posterior probability. 


Discussion 


Simplicity is a complex and ambiguous concept. This ambiguity may ex- 
plain why people are so divided over whether or not it does have cognitive 
value in science. From a Bayesian point of view, it is most fruitful to in- 
vestigate the concept of simplicity as elegance, that is, as referring to the 
number and the complexity of hypotheses in a scientific theory. The onto- 
logical dimension of simplicity (“parsimony”) is left out of the picture. 
The cognitive value of simplicity as elegance has been investigated in 
the framework of model selection and curve-fitting, that is, in fitting a 
scientific hypothesis (=a particular curve) to a set of data points. In this 
context, simplicity has epistemic value as an antidote to estimation error, 
that emerges from the multiple estimation of various parameters. How- 
ever, this qualitative finding does not answer the question of whether there 
is an optimal tradeoff rate between simplicity and other cognitive values 
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in model selection, most notably goodness-of-fit. 

In answering this quantitative question, we have reviewed two model 
selection criteria: AIC and BIC. First, we have taken issue with Forster 
and Sober’s claim that an optimal tradeoff rate between simplicity and 
goodness-of-fit is established by the non-Bayesian Akaike’s Information 
Criterion AIC. Then we moved on to the Bayesian Information Criterion 
BIC. However, while the derivation BIC is couched into Bayesian language, 
properly Bayesian elements, such as the prior probability of different mod- 
els, are swept under the carpet. This sheds doubt on whether BIC is, as its 
name promises, a fully Bayesian model selection criterion. 

The finding from these case studies is that most model selection cri- 
teria do not emerge from rigorous derivations, backed by a particular 
philosophical approach (e.g., frequentism, Bayesianism or likelihoodism). 
These approaches provide the conceptual and mathematical framework 
for deriving a model selection criterion (such as AIC or BIC), but no wa- 
terproof justifications. Even more, it is also possible to find a non-Bayesian 
justification for BIC, and a Bayesian justification for AIC (Romeijn et al., 
2012). So it is misleading to attach model selection criteria to particular 
philosophical schools. The judgment on when these criteria do and do not 
work is highly context-sensitive. The Bayesianism involved in the deriva- 
tion of BIC might be characterized as an instrumental Bayesianism—an 
approach to statistical inference which is happy to Bayes’s Theorem as 
a scientific modeling tool, but without taking the Bayesian elements too 
literally, as expressions of subjective uncertainty. 

On the positive side, our findings imply that we can reject some of the 
bolder anti-Bayesian conclusions drawn from the investigation of model 
selection criteria. For instance, Forster and Sober (1994) write: 


Bayesianism is unable to capture the proper significance of con- 
sidering families of curves [...]. Akaike’s reconceptualization 
of statistics does recommend that the foundations of Bayesian 
statistics require rethinking. (Forster and Sober, 1994, 26, orig- 
inal emphasis) 


In the light of the above results, we can safely conclude that this conclu- 
sion is mistaken. First, Bayesian and non-Bayesian model selection criteria 
stand on equal footing. Second, the link between particular model se- 
lection criteria and philosophical schools is not particularly tight. Third, 
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Bayesianism considers, in calculating Bayes factors and approximations to 
them (such as the BIC), the significance of families of curves (models) as 
opposed to single curves (fitted models). It may also be noted that Forster 
and Sober do not repeat this claim in later publications, probably realizing 
that their conclusion went too far. 


IJ. Good (1971) once quipped that there are at least 46656 varieties 
of Bayesians. After studying Bayesian model selection, we may add that 
there are also many different ways of doing Bayesian model selection— 
as long as one conceives Bayesian reasoning in a more general way than 
just an inference from prior to posterior probability. In model selection, 
Bayesian reasoning is often applied in an instrumental manner. While 
these Bayesian methods are adequate frameworks for investigating our 
core question about why simplicity matters in model selection, they fail, 
like their non-Bayesian counterparts, to state an optimal tradeoff rate for 
simplicity and goodness-of-fit. As our examples and analyses have shown, 
that latter question may just depend too much on the specific context to 


allow for an illuminating general answer. 


This observation also leads to several avenues for further research. 
First, if foundational philosophical arguments fail to ground model selec- 
tion criteria, what are the context-sensitive considerations that lead us to 
prefer one of the model selection criteria over others? What are the reasons 
why we decide to work with AIC instead of BIC, or vice versa? Second, 
AIC and BIC are only two of a large number of model selection criteria, 
and it would be good to extend the analysis in order to see whether the 
conclusions of this variation remain valid. One of us (Sprenger, 2013c) 
has conducted such an analysis with respect to the Minimum Message 
Length (MML, Dowe et al., 2007) and the Deviance Information Criterion 
(DIC, Spiegelhalter et al., 2002), but a more detailed study, that also in- 
volves further model selection criteria, would be highly welcome. Third, 
it would be exciting to see whether our thesis of instrumental Bayesian- 
ism (and instrumental frequentism, for the case of AIC) also extends to 
other cases of scientific inference. That is, can we find cases of scientific 
reasoning where schools of uncertain reasoning are treated as a quarry 
for mathematical formalisms rather than as a philosophical basis for jus- 
tifying one’s inference? Fourth, one may compare the role of simplicity 
in model selection to its role in other forms of scientific reasoning, e.g., 


causal inference, and check whether the results remain the same. 


254. 10.6. Sketch of the Derivation of Akaike’s Information Criterion 


The last two variations have demonstrated the fruitfulness of Bayesian 
reasoning in statistical inference. The final variation will respond to a 
foundational objection that is frequently raised against the (subjective) 
Bayesian: that Bayesian inference is not objective enough to be of any 


value in science. 


Sketch of the Derivation of Akaike’s Information Cri- 
terion 


At the end of this variation, we summarize the main steps of the derivation 
of AIC below, with a focus on the philosophical rationale that motivates 
this model selection criterion. Detailed treatments can be found in Chapter 
4.3 of Sakamoto et al. (1986) and Chapter 7.2 in Burnham and Anderson 
(2002). 

The AIC aims at estimating the “expected predictive success” of a 
model, identified with its maximum likelihood estimate (MLE) Soy): 


log fx) | = E,E, [log f(x)] — ExEy [log gay) (x)| (10.9) 


The first term on the right hand side of (10.9) is equal for all candidate 
models. When comparing them, it drops out as a constant. Hence we can 
neglect it in the remainder and focus on the second term in (10.9). 

The AIC is usually derived by a double Taylor expansion of the log- 
likelihood function. The general formula of Taylor expansion for an ana- 
lytic, real-valued functions f is 


(oe) 


F(x) = VFO (x0) (x — x0)". 
k=0 
In our case, we expand the term log g4,,) (y)—our MLE—around 6, the 
value of @ that minimizes Kullback-Leibler divergence to the true model. 
The expansion is trunctated at k = 2 yielding 
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The matrix ‘ 
4) 
J:= — 592 108 80(x) (80) 


that also occurs in (10.10) is called the Fisher information matrix of the data. 
It plays a crucial role in an asymptotic approximation of the maximum 
likelihood estimator that holds under plausible regularity conditions: 


VN(6(y) — 8) + N(0,J7?). 


This asymptotic normality of the maximum likelihood estimator can be 
used to simplify (10.10). The term 


VN(8(y) — 60)" (—J) VN(8(y) — 60) (10.11) 


is asymptotically x?-distributed with K degrees of freedom. Hence, the 
expectation of (10.11) is K. By taking a double expectation over x and y, 
we thus obtain that 

ExEy | 5N(6(y) — 6)" ( ( Sy log.go(x) ) (60) } (6(y) ~ &) | = 


Moreover, the linear term in (10.10) vanishes because the maximum like- 
lihood estimate is an extremal point of the log-likelihood function. Thus, 
the mean of the first derivative is also zero: 


ExE, |N ( (35 log.sa(x) ) (60)) ((y) - 60)| = (10.13) 


Combining (10.10) with (10.12) and (10.13), we obtain for large samples 
that 


K 
ExEy [log g4,(x)| © Ex [log ga,(x)] — >. (10.14) 
Repeating the Taylor expansion around the maximum-likelihood estimate 
and applying the same arguments once more gives us 


K 
ExEy [log go,(x)] © Ey [log gay (y)] — 5- (10.15) 


Finally, by combining (10.14) and (10.15) we obtain AIC as an estimate of 
“expected predictive accuracy”: 


E,Ey [log g6y (x)| ~ Ey [log 86. (y)| —K. 
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Variation 11: Scientific Objectivity 


Scientific objectivity pertains in the first place to scientific method. It ex- 
presses the idea that the claims, methods and results of science are not, or 
should not be influenced by particular perspectives, value commitments, 
community bias or personal interests, to name a few relevant factors (for 
a survey, see Reiss and Sprenger, 2014). Objectivity contributes to the reli- 
ability of scientific research, conveys an image of epistemic authority and 
strengthens our trust in science. The 2009 “Climategate” affair, when cli- 
mate scientists were charged with presenting data in a misleading way, 
and the widespread failure to replicate experimental results in psychology 
(Galak et al., 2012; Open Science Collaboration, 2015) illustrate how an 
apparent lack of objectivity weakens trust in scientific findings. 

The ideal of objectivity has been criticized repeatedly in philosophy of 
science, questioning both its value and its attainability (e.g., Feyerabend, 
1975; Kuhn, 1977b). This variation does not aim at defending it. Rather, 
we start from the assumption that some degree of objectivity is beneficial 
in scientific reasoning. Then we discuss what kind of objectivity can be 
delivered by, and is compatible with, Bayesian inference. This discussion 
will be focused on Bayesian statistics, which we see as the primary ap- 
plication of the subjective probability calculus in science. We investigate 
three challenges to the objectivity of Bayesian inference, and for each of 
them, we develop proposals how they can be rebutted, and how Bayesian 
reasoning in science can be defended. 

This variation is divided into five sections. Section 11.1 presents some 
background on the role of objectivity in science. Section 11.2, 11.3 and 11.4 
describe three major challenges to Bayesian inference. In our replies, we 
blend arguments from statistical methodology with recent results in the 
philosophical analysis of scientific objectivity (e.g., Douglas, 2004, 2011). 
The final Section 11.5 places our arguments into a broader perspective 
and concludes. 
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Forms of Scientific Objectivity 


There are two principled ways of understanding scientific objectivity: we 
can relate it to the products of science (e.g., theories, models, laws) and 
their relation to the world, and we can regard it as a property of the pro- 
cess of scientific reasoning. In line with the approach of this book, we focus 
on process objectivity instead of product objectivity (Reiss and Sprenger, 
2014): we are interested in the objectivity of the procedures that lead to 
the acceptance of scientific theories, not in whether the theories them- 
selves provide a faithful image of reality. First of all, that latter question 
stands orthogonal to our discussion of Bayesian inference in science. Sec- 
ond, the traditional idea of objectivity as correspondence between theory 
and world has been thoroughly debunked: The historical case studies of 
Porter (1996) and Daston and Galison (2007), the systematic work on the 
theory-observation interaction by Thomas Kuhn (1962, 1977b) and Paul 
Feyerabend (1962, 1975), and finally, critiques from feminist philosophy of 
science and standpoint epistemology (e.g., Longino, 1990; Harding, 1991; 
Lloyd, 2005; Okruhlik, 2005) have led to the conclusion that the view of 
scientific objectivity as faithfulness to facts is problematic. If objectivity 
is supposed to be a meaningful cognitive value, it needs to be associated 
with peculiar ways of scientific reasoning. 


Most analyses of objectivity in scientific reasoning proceed in one of the 
two following ways: (1) scientific reasoning is objective to the extent that 
it is free of non-cognitive (moral, social and political) values; (2) scientific 
reasoning is objective to the extent that personal biases are absent, or that 
they can be eliminated in a social process. These two forms of objectivity 
are related, e.g., personal bias is often expressed by endorsement of a 
particular non-cognitive value, such as a political ideology. To simplify the 
analysis, we will not draw a sharp distinction between them. Nevertheless, 
we would like to stress that these forms of objectivity cannot be reduced 
to each other: not all individual biases correspond to a particular non- 
cognitive value; not all non-cognitive values correspond to an individual 


bias. 


Moreover, it is helpful to distinguish four stages at which non-cognitive 
values may affect science. They are: (i) the choice of a scientific research 
problem; (ii) the gathering of evidence in relation to the problem; (iii) 
the assessment and acceptance of a scientific hypothesis or theory as an 
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adequate answer to the problem on the basis of the evidence; (iv) the 
proliferation and application of scientific research results (Weber, 1904, 
1917). 

Most philosophers of science would agree that the role of values and 
bias in science is contentious only with respect to dimensions (ii) and (iii): 
the gathering of evidence and the assessment of scientific theories. It 
is almost universally accepted that the choice of a research problem is 
often influenced by the interests of individual scientists, funding parties, 
and society as a whole. This influence may make science more shallow 
and slow down its long-run progress, but it has benefits, too: scientists 
will focus on providing solutions to those intellectual problems that are 
considered urgent by society and they may actually improve people’s lifes. 
Similarly, the proliferation and application of scientific research results is 
evidently affected by the personal values of journal editors and end users. 
The real debate is about whether or not the ‘core’ of scientific reasoning— 
the gathering of evidence and the assessment and acceptance scientific 
theories—is, and should be free of values and bias. And since Bayesian 
inference provides a theory of scientific inference rather than a theory 
of designing and conducting experiments, we will focus on stage (iii)— 
scientific theory choice— in particular. 

We can now define objectivity in scientific inference according to the 


two conceptions mentioned above: 


Value-Free Ideal (VFI): Scientists should strive to minimize 
the influence of non-cognitive values on scientific reasoning in 
gathering evidence and assessing and accepting scientific the- 


ories. 


It is mainly this ideal against which Bayesian inference and its main 
competitor in scientific reasoning, frequentist inference, will be assessed. 
Other relevant forms of objectivity, such as intersubjective agreement on 
which conclusions to draw from a body of data, are often related to the 
VFI. For instance, differences in the inferences that scientists draw may 
be tracked to the impact of non-cognitive values on the reasoning of the 
individual scientists. 

Apparently, there is a straightforward clash between the VFI and sub- 
jective Bayesian inference: “a notion of probability as personalistic de- 
gree of belief [...], by its very nature, is not focused on the extraction and 
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presentation of evidence of a public and objective kind” (Cox and Mayo, 
2010, 298). This view is echoed in writings of well-known statisticians 
and philosophers of science such as Fisher (1956), Mayo (1996), Popper 
(2002) and Senn (2011). By and large, the objectivity-related criticisms of 
subjective Bayesian inference come in three different forms. First, the lack 
of constraints on prior probabilities, second, the entanglement of statisti- 
cal evidence and degree of belief, third, the apparent blindness to bias in 
experimental design. In the light of these challenges, one is tempted to 
conclude that Bayesian inference cannot produce objective knowledge, is 
not suitable for scientific communication and is therefore inferior to fre- 


quentist inference. Let us now review these challenges in detail. 


Challenge 1: The Choice of the Prior Distribution 


As explained in the introduction of this book, the core of Bayesian infer- 
ence consists in representing degrees of belief by probabilities, in changing 
them by means of Bayesian Conditionalization, and in basing decisions on 
posterior probabilities. In particular, the posterior degree of belief in a 
hypothesis H upon learning evidence E can be written as follows: 


p(H)p(B|H) 

PE SR) (11.1) 
where p(E) = Lyey p(E|H)p(H) is the marginal probability of data E. 
On the basis of the posterior probability p(H|E), a Bayesian can form a 
theoretical judgment about H or make a practical decision. For example, 
if H is the hypothesis that a new medical drug is not more efficacious than 
a placebo, and if H is sufficiently probable given the data, then we will not 
pursue further development of the drug. 

Posterior probability depends on prior probability, and often, there is 
not sufficient background knowledge to establish consensus on the latter. 
Subjective Bayesians such as de Finetti (1972) have stressed that in prin- 
ciple, any coherent prior probability distribution can be defended as 
rational. This attitude seems to jeopardize any claims to objectivity that 
subjective Bayesians could possibly make. What kind of epistemic warrant 
does a Bayesian inference still provide? After all, the choice of the prior 
can hide all kind of pernicious values, e.g., financial interests of the ex- 
periment sponsor. This is particularly worrying in sensitive areas such as 
medicine, where the need for impartial inference methods is particularly 
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high, due to the manifest financial interests in clinical trials and the ethical 
consequences of wrong decisions. As the medical methodologist Lemuel 
Moyé writes: 


Without specific safeguards, use of Bayesian procedures will 
set the stage for the entry of non-fact-based information that, 
unable to make it through the “evidence-based” front door, 
will sneak in through the back door of “prior distributions”. 
There, it will wield its influence, perhaps wreaking havoc on 
the research’s interpretation. (Moyé, 2008, 476) 


In other words, Bayesians can bias the final result in their preferred direc- 
tion by choosing an appropriate prior. The first challenge is thus based on 
the value-free ideal that the core business of scientific reasoning, namely 
evaluating evidence, assessing and accepting theories, should be free of 
non-cognitive values and individual biases—a requirement that Bayesian 
inference seems to violate blatantly. Adherence to the value-free ideal 
has, however, in one form or another, been upheld as a trademark of sci- 
entific objectivity (e.g., Lacey, 1999; Reiss and Sprenger, 2014), and for 
practitioners, it plays an even greater role due to regulatory constraints 
and conflicts of interests. Even if one doubts that the value-free ideal can 
be attained in practice, values should not be allowed to replace scientific 
evidence (Douglas, 2009b). How can Bayesian inference be safeguarded 
against this danger? 

The first line of defense notes that subjective opinion need not be the 
same as individual bias. Two medical doctors may, on the basis of their 
experience, give a different judgment about what might be a good therapy 
for a patient with a given set of symptoms. The fact that they disagree 
does not mean that one of them or both are biased: they may have enjoyed 
a different training, come from different disciplines or have different ex- 
perience in dealing with those symptoms. Prior probability distributions 
provide a way to make explicit a judgment that is fed by individual ex- 
pertise and track record. This is also a reason why many models of expert 
judgment and decision-making use subjective Bayesian inference—even 
when objective risk assessments are required (Cooke, 1991). 

The second line of defense notes that prior probabilities are open to 
rational criticism. Whenever a prior distribution is used, be its shape con- 
ventional or peculiar, the researcher should justify her particular choice 
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and explain which considerations (theoretical and empirical ones) led her 
to this choice. We cannot justify an extreme posterior simply by choosing a 
suitably extreme prior because it is part of the Bayesian model of reasoning 
that also the prior needs to be justified. This is also explicit in regulations 
for medical trials, such as the guidelines for the use of Bayesian statistics, 
issued by the Food and Drug Administration of the United States: 


We recommend you be prepared to clinically and statistically 
justify your choices of prior information. In addition, we rec- 
ommend that you perform sensitivity analysis to check the ro- 
bustness of your models to different choices of prior distribu- 
tions. (US Food and Drug Administration, 2010) 


The above quote hints to a second requirement in Bayesian reasoning: 
to perform a sensitivity analysis on the choice of the prior and to check 
whether the main result of the research remains intact under different 
prior assumptions. Such an analysis also contributes to scientific objectiv- 
ity in terms of “convergent objectivity” (Douglas, 2004, 2009b), according 
to which a scientific result can claim to be objective when it is validated 
from different assumptions and perspectives. Checking how a variation in 
the prior affects variation in the results therefore contributes to drawing 
conclusions which satisfy this sense of objectivity. 

Finally, the third line of defense notes that the explicit choice of a prior 
distribution exposes modeling assumptions more clearly than competing 
paradigms. In frequentist inference, for example, such assumptions are 
more implicit and harder to identify. This makes it easier for the Bayesian 
to criticize a particular choice, contributing to scientific objectivity in the 
sense of a reasoning process that is transparently conducted and open to 
rational criticism (Longino, 1990). We will get back to this point in the 
final section. 

The bottom line is that the choice of the prior is just like any other 
modeling assumption in science open to rational criticism. Indeed, if the 
prior were not varied and judged critically, there would be no corrective 
mechanism for gauging to what extent personal bias has influenced the 
results through the choice of the prior. But the same is true of scientific 
inference in general, and of competing statistical paradigms in particular. 
Invalidating a subjective Bayesian analysis with a biased prior is as easy 
or difficult as invalidating a non-Bayesian analysis with biased modeling 


Variation 11: Scientific Objectivity 263 


assumptions. Therefore, this challenge is not more fearsome for Bayesians 
than for any other framework of inductive inference. We now move on to 


the next challenge: that Bayesians mix up belief and evidence. 


Challenge 2: Belief vs. Evidence 


The second challenge contends that scientific reasoning, and statistical 
analysis in particular, is not about assessing the degree of belief in a hy- 
potheses, but about finding out whether a certain effect is real or due to 
chance. On this view, the Bayesian statistician commits a category mis- 
take: she tries to answer a question that scientists are not (and should not 
be) interested in, namely how plausible a hypothesis is from a subjective 
point of view. Statistical reasoning should be independent of such judg- 
ments; it is the task of science to state the objective evidence for the truth of 
the hypothesis. Ronald A. Fisher, one of the fathers of modern statistics, 
forcefully articulated this view: 


Advocates of inverse probabilities [=ascribing probabilities to 
scientific hypotheses] are forced to regard mathematical prob- 
ability, not as an objective quantity measured by observable 
frequencies, but as measuring merely psychological tenden- 
cies [=degrees of belief], theorems respecting which are useless 
for scientific purposes (Fisher, 1935, 6-7, our explanations in 
parentheses) 


Royall (1997, 4) makes a similar distinction between three major questions 
in statistical analysis: “What should we believe?”, “What should we do?” 
and “What is the evidence?”. A good answer to one of them need not 
be a good answer to another question. The Bayesian answers the belief 
question by providing prior and posterior probabilities, and the decision 
question by means of its connection to rational choice theory (e.g., Savage, 
1972), but what does a satisfactory response to the evidence question look 
like? 

Underlying this challenge is the idea of “detached objectivity” (Dou- 
glas, 2009b, 459): claims to scientific knowledge should be detached from 
personal belief and wishful thinking. Bayesians also struggle to achieve 
“concordant objectivity” (Douglas, 2004, 462-463) which is expressed in in- 
tersubjectively agreed assessments of evidence. As Quine (1992, 5) stated: 
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“The requirement of intersubjectivity is what makes science objective.” 
However, the “psychological tendencies” that correspond to personal de- 
grees of belief do not fulfil this requirement. 

Many philosophers and scientists share the view that subjec- 
tive Bayesian inference falls short of achieving concordant objectivity. 
Williamson (2007) notes that “full objectivity—i.e. a single probability 
function that fits available evidence” cannot be achieved in the subjective 
Bayesian framework. Bem et al. (2011, 718) quote a Psychological Science 
referee as saying 


I have great sympathy for the Bayesian position [...] The prob- 
lem in implementing Bayesian statistics for scientific publica- 
tions, however, is that such analyses are inherently subjective, 
by definition [...] with no objectively right answer as to what 
priors are appropriate. I do not see that as useful scientifically. 


In other words, it is unclear how Bayesians can separate personal belief 
from evidential support and achieve intersubjective agreement on levels 
of evidence. 

To address this worry, we study the most popular Bayesian measures 
of evidential support in some detail. The Bayes factor (Kass and Raftery, 
1995), which we have encountered in previous variations, expresses the 
support for Ho over the alternative H in terms of the ratio of posterior 
and prior odds. Equivalently, the Bayes factor can be expressed as the 
ratio of (integrated) likelihood of Ho and Hj: 

pP(HoIE) p(Hi) — Soco, P(E|®)p(9)40 
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It is important to note that the Bayes factor is not affected by p(Ho) and 


p(H;) simpliciter. For two point hypotheses Hp and Hj, it is even fully 
independent of the prior probability distribution: it is just the likelihood 
ratio p(E|Ho)/p(E|H1), indicating how much E favors Ho over Hj. Noth- 
ing depends on personal belief. 

For composite hypotheses (e.g., Hi : @ # 0), things are more compli- 
cated. The value of the Bayes factor depends on how likely the observed 
evidence is under the various components of Ho and Hi, weighted with 
their relative prior probability. It is important to realize that this depen- 
dency is benign and not pernicious in the context of null hypothesis test- 
ing. Imagine the frequent case that we are testing the null hypothesis that 
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a certain intervention, e.g., taking vitamin C tablets as a cure for the com- 
mon flu, has no effect at all: Hp : 6 = 0 and H; : @ # 0, where @ is the 
variable denoting the effect size. Of course, it is implausible that the effect 
of the vitamin C intervention is exactly zero: the pill will cause some bio- 
chemical reaction in the human body. The test does not aim at ruling out 
this possibility, but at finding out whether we can use the null hypothesis 
as a simple and precise, but strictly speaking wrong idealization of a com- 
plex reality. In order to assess whether a finding is evidence for or against 
Ho, we need to know which effect sizes are plausible and clinically rele- 
vant. Only if this is clarified, we can state meaningfully that the observed 
results speak in favor of or against the null hypothesis. This conventional 
methodological wisdom is mirrored in the calculation of Bayes factors. 

In fact, also frequentist inference with null hypothesis significance 
tests (NHST) needs such plausibility judgments. In NHST, the null hy- 
pothesis Ho : 6 = 6 of zero effect is pitched against the alternative hy- 
pothesis H; : 6 4 6 that there is some effect (— Variation 9). While a type 
I error corresponds to erroneous rejection of the null hypothesis, a type II 
error stands for erroneous acceptance of the null, or more precisely, failure 
to reject the null. Conventionally, acceptable type I error rates are set at 
a level of 5%, 1% or 0.1%, dependent on the experiment. By choosing an 
appropriate sample size N, one tries to minimize type II error. In other 
words, we may be rational in following the test procedure because of its 
favorable long-run properties: 


[...] we shall reject Hp when it is true not more, say, than once 
in a hundred times, and in addition we may have evidence that 
we Shall reject Ho sufficiently often when it is false. (Neyman 
and Pearson, 1967, 142) 


In other words, the relative frequency of correct decisions will clearly ex- 
ceed the relative frequency of wrong decision. This reliance on favorable 
long-run properties also explains the name frequentism. 

However, when H, is a composite hypothesis (e.g., 6 4 69), the power 
of an experiment has to be calculated relative to specific effect sizes. Usu- 
ally, one would choose effect sizes which correspond to theoretical expec- 
tations and which imply a scientifically meaningful difference to the null 
hypothesis. Without a judgment on which effect sizes are likely to be ex- 
pected, the choice of the sample size N is tantamount to groping in the 
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dark. One may collect much more evidence than needed or, in the op- 
posite case, end up with a severely underpowered study. Therefore, the 
relative plausibility of the different alternatives and the initial plausibility 
of the null hypothesis affects the design and evaluation of experiments in 
frequentist inference, too. 

The same dependency is even more evident when frequentist inference 
is used for supporting practical decisions. As already argued by Rudner 
(1953), choosing and balancing type I and type II error levels involves 
non-cognitive value judgments: we implicitly reveal how severe and how 
probable we find these errors. A decision about whether or not to ac- 
cept/reject a hypothesis and to act on its basis must trade off plausibility 
with the utility of an act and the strength of the evidence. Hence, the 
view that degrees of belief must not play any role in assessing evidential 
support is taking the value-free ideal and the idea of detached objectivity 
one step too far. If applied rigorously, this would mean that we could stop 
making inferences from data to theory. Indeed, also Douglas (2004, 460) 
stresses that objectivity in scientific reasoning does not imply the elimi- 
nation of personal perspective; this would actually be a gross misrepre- 
sentation of how science works. Therefore we conclude that the second 
challenge is misguided, too: scientific evidence cannot, and should not, be 
neatly separated from judgments of plausibility and degrees of belief. 


Challenge 3: Neglect of Experimental Design 


The third challenge to subjective Bayesianism concerns the problem of bias 
in trials with interim looks at the data. The problem can best be motivated 
with an example from medicine. Randomized Controlled Trials (RCTs) 
are currently the gold standard within evidence-based medicine. They are 
usually conducted as sequential trials allowing for monitoring for early 
signs of effectiveness or harm. In sequential trials, data are typically mon- 
itored as they accumulate. That is, we have interim looks at the data and 
we may decide to stop the trial before the planned sample size is reached. 
By terminating a trial when overwhelming evidence for the effectiveness 
or harmfulness of a new drug is available, the prohibitive costs of a medi- 
cal trial can be limited and in-trial patients are protected against receiving 
inferior treatment. 


Such truncated trials are often seen as problematic. In a review of 
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134 trials stopped early for benefit, Montori et al. (2005) point to an in- 
verse correlation between sample size and treatment effect: the smaller 
the sample size achieved by the trial at the moment of stopping, the larger 
the estimate it provided for the effect. These findings are supported by 
a more recent study by Bassler et al. (2010) where truncated trials report 
significantly higher effects than trials that were not stopped early. While 
the authors of these studies do not object to monitoring and truncating 
trials in general, they advocate that results (e.g., effect size estimates) from 
such trials be treated with caution. Bayesian measures of evidence such as 
the Bayes factor do not depend on the sampling protocol or experimental 
design and evaluate truncated trials like fixed-sample trials. This seems to 
introduce a bias toward overestimating effect sizes. 


Indeed, critics of Bayesian inference such as Deborah Mayo (1996) com- 
plain that decoupling statistical inference from the sampling protocol “can 
lead to a high probability of error, and [...] this high error probability is 
not reflected in the interpretation of data” (Mayo and Kruse, 2001). In the 
context of medical research, the Bayesian seems to provide carte blanche for 
implementing any design that favors the pursuit of certain non-cognitive 
values, such as the financial interests of the trial sponsor. For instance, we 
could sample on until a convincing result is reached, conduct a Bayesian 
analysis and submit the study for publication, without mentioning the bi- 
ased sampling procedure. After all, whether the data were obtained by 
means of a biased sampling protocol, an unbiased protocol or no protocol 
at all affects neither the posterior probability nor the Bayes factor. Again, 
the perceived threat to the objectivity of Bayesian inference comes from 
the hidden intrusion of bias and non-cognitive values into statistical rea- 
soning. 

Four responses can be made to this criticism. First, the phenomenon 
on which the criticism is based can also be described differently. Higher 
effect sizes in truncated trials are not surprising, but predictable (Good- 
man et al., 2010). Of all treatments, highly efficacious ones will be most 
prone to early termination for benefit. That is, when the actual effect size 
is large, it is more probable that we also observe a large effect in our 
sample and decide to terminate the trial. Hence, the observed difference 
between truncated and completed trials is precisely what we should ex- 
pect. Comparing truncated to completed trials amounts, as highlighted 
by Berry et al. (2010), to selecting the trials to be compared on the basis 
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of their outcome. In that light, it is questionable whether the observed 
effect size difference between truncated and non-truncated trials is really 
problematic. 


Second, prior knowledge or empirically-based prior expectations are 
highly relevant for dealing with overestimated effects. Imagine that we are 
interested in the relative risk reduction which a medical drug provides. A 
Bayesian represents her uncertainty by means of a prior probability dis- 
tribution over that quantity. By means of Bayes’ Theorem, this distribu- 
tion is updated to a posterior probability distribution that synthesizes the 
observed evidence with the background knowledge. Then, the Bayesian 
framework naturally accounts for the intuition that truncated trials should 
be treated with caution: for the same observed effect size, small sample 
sizes change the prior distribution less than large sample sizes. The poste- 
rior distribution visualizes these differences in an intuitive way that can be 
directly used for decision-making (Goodman, 2007; Nardini and Sprenger, 
2013). In other words, the subjective Bayesian has an automatic safeguard 
against rash conclusions which other inference schools do not possess. 


Third, that Bayes factors do not depend on the sampling protocol does 
not imply that Bayesians should ignore matters of experimental design. 
Procedural objectivity in the form of following certain regulatory con- 
straints and standard procedures can be helpful to eliminate certain forms 
of institutional bias (Douglas, 2004, 2009b). In fact, guidelines for the use 
of Bayesian statistics (such as the ones issued by the Food and Drug Ad- 
ministration) stress that Bayesians should be as conscious and diligent in 
matters of experimental design as frequentists. For instance, also from 
a Bayesian perspective, a test with high type I and type II errors is evi- 
dently a bad test. The point of disagreement is different: while the fre- 
quentist bases her post-experimental evaluation of the evidence on the 
pre-experimental design and the properties of the entire experiment, the 
Bayesian considers these properties as essential for obtaining valid data, 
but as orthogonal to the question of how to interpret them once they are 
in (see also Sprenger, 2009). 


Fourth, the neglect of stopping rules follows immediately from a fun- 
damental principle of Bayesian inference: the Likelihood Principle. Ac- 
cording to that principle, all experimental evidence (=judgments of evi- 
dential support) about an unknown parameter @ is contained in the likeli- 
hood function Le (6) = p(E|@) for observed data E. Formally, the principle 
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is stated thus: 


Likelihood Principle (LP): Consider a statistical model M 
with a set of probability measures p(-|@) parametrized by 
6 € ©. Assume we conduct an experiment € in M. Then, 
all evidence about @ generated by € is contained in the likeli- 
hood function p(E|@), where the observed data E are treated as 
a constant. (Birnbaum, 1962; Berger and Wolpert, 1984) 


This principle is one of the cornerstones of Bayesian inference. As Birn- 
baum (1962) showed in a celebrated paper, it can be derived from two more 
foundational principles: the Sufficiency Principle and the Conditionality 
Principle. 

We begin with the first one. A statistic (i.e., a function of the data 
X) T(X) is sufficient if the distribution of the data X does not depend on 
the unknown parameter 6, conditional on T. In other words, sufficient 
statistics are compressions of the data set that do not lose any relevant 
information about 6. An example is an experiment about the bias of a coin. 
Assuming that the tosses are independent and identically distributed, the 
overall number of heads and tails is a sufficient statistics for an inference 
about the bias of the coin. Thus, we can neglect the precise order in which 
the results occurred. Formally, the Sufficiency Principle states that any 
two observations x; and x2 are evidentially equivalent with regard to the 
parameter of interest @ as long as T(x;) = T(xz) for a sufficient statistic T. 
Therefore, the principle is usually accepted by Bayesians and frequentists 
alike. 

The Conditionality Principle is more controversial: it states that evi- 
dence gained in a probabilistic mixture of experiments is equal to the ev- 
idence in the actually performed experiment. In other words, if we throw 
a die to decide whether experiment €; is conducted (in case the die comes 
up with an odd number) or experiment €2 (even number) and we throw 
a six, then the evidence from the overall experiment € = €) © €2 is equal 
to the evidence from €2. Frequentists usually reject Conditionality since 
their measures of evidence take the entire sample space into account. A 
thorough discussion of these principles goes beyond the scope of this vari- 
ation and can be found in, e.g., Mayo (2010) or Gandenberger (2015). To 
the extent that the Sufficiency and Conditionality Principle are found com- 
pelling requirements for objective, truth-directed inference that is free of 
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non-cognitive values, the Likelihood Principle constrains our judgments of 
evidential support in a way that is incompatible with frequentist inference, 
e.g., p-values. 

In particular, the Likelihood Principle implies the Stopping Rule Prin- 
ciple (Berger and Berry, 1988, 34). Since the Likelihood Principle implies 
that only information contained in the likelihood function are evidentially 
relevant, and since the likelihood functions of the parameter values under 
different stopping rules are proportional to each other (proof omitted), 
stopping rules cannot have an evidential role. Indeed, Bayesians argue 
that 


The design of a sequential experiment is [...] what the exper- 
imenter actually intended to do. (Edwards et al., 1963, 239. See 
also Savage 1962, 76.) 


In other words, since such intentions are “locked up in [the experi- 
menter’s] head” (ibid.), not verifiable for others, and apparently not 
causally linked to the data-generating process, they should not matter 
for sound statistical inference. Hence the dismissal of stopping rules in 
Bayesian judgments of evidential support. 

This position has substantial practical advantages: if trials are termi- 
nated for unforeseen reasons, e.g. because funds are exhausted or because 
unexpected side effects occur, the observed data can be interpreted prop- 
erly in a Bayesian framework, but not in a frequentist framework. Same 
for cases where the sampling protocol cannot be retrieved (e.g., historical 
records). As externally forced discontinuations of sequential trials fre- 
quently happen in practice, claims to the evidential relevance of stopping 
rules would severely compromise the proper interpretation of sequential 
trials. 

In total, the claim that Bayesian inference in sequential trials contains 
an implicit bias or invalidates scientific reasoning can be soundly rebutted. 
The particular problem of sequential analysis and monitoring ongoing tri- 
als poses no challenge to Bayesian inference that is not equally pressing 
for its competitors, such as frequentist inference. 


Discussion: A Digression on Scientific Objectivity 


The concept of scientific objectivity is a notoriously difficult one, with var- 
ious aspects and interpretations. It is a commonly shared view, though, 
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that objective conclusions support the epistemic authority of science, dis- 
tinguishing it from religion or political ideology. No wonder that statistical 
approaches are also valued according to their ability to provide an image 
of objectivity. Objectivity can manifest itself in different aspects of scien- 
tific reasoning, e.g., in intersubjective agreement on evidence, priority of 
evidence over values, freedom of idiosyncratic bias, standardization of in- 
ference procedures, responsiveness to criticism, and so on. The standard 
criticisms of Bayesian inference relate to selected aspects of the complex 
notion of scientific objectivity. We recap the main ideas below. 


First, there is the idea that subjective Bayesian inference is particularly 
vulnerable to the intrusion of bias and non-cognitive values due to their 
dependence on prior probabilities, and the lack of restrictions on choosing 
them. However, prior degrees of belief can incorporate valuable expertise 
and background information and they can (and should!) be criticized like 
any statistical model assumption. Once these points are recognized, the 
challenge loses its bite. It can also be demonstrated that sensitivity analysis 
in Bayesian inference contributes to convergent objectivity in Douglas’s 
sense: validation of a result from different independent perspectives. 


Second, there is the fear that on a Bayesian approach, scientific evi- 
dence is always entangled with (possibly idiosyncratic and biased) sub- 
jective judgments of belief. Similarly, one may argue that intersub- 
jective agreement on levels of evidence—the concordant dimension of 
objectivity—is hard to achieve on a Bayesian approach. We have shown 
that these fears are not substantiated in standard Bayesian hypothesis test- 
ing. And for composite hypotheses, the Bayes factor (=the Bayesian’s stan- 
dard measure of evidence) only depends on the relative weight of the in- 
dividual hypotheses—a dependency which we have argued to be benign 
and necessary for meaningful scientific inference. Moreover, frequentist 


inference exhibits the same dependency. 


Third, Bayesian inference is criticized for neglecting that certain exper- 
imental designs may lead to biased effect sizes. However, this criticism 
relies on a questionable description of evidence from truncated trials and 
on a failure to recognize how observations and prior belief are amalga- 
mated in Bayesian inference. In addition, the Likelihood Principle pro- 
vides a justification for why stopping rules, and aspects of experimental 
design more generally, should not affect post-experimental judgments of 
evidential support. 


pa es 11.5. Discussion: A Digression on Scientific Objectivity 


It is also worth glossing on aspects of objectivity that relate to the so- 
cial dimension of science. Helen Longino (1990) has forcefully argued that 
scientific objectivity is not only about scientific reasoning itself, but also 
about the structure of scientific discourse: the possibility of openly crit- 
icizing each other’s assumptions, providing a floor for the exchange of 
rational arguments, etc. In this respect, Bayesian inference has several im- 
portant assets: it is honest and transparent about the assumptions it makes 
and clearly distinguishes between prior belief, evidence, and conclusions 
(=posterior belief). This points out clear avenues for model criticism and 
allows for a straightforward detection of inappropriate bias, such as prior 
assumptions that heavily favor a particular hypothesis. Moreover, subjec- 
tive Bayesianism provides a rigorous description of what happens when 
the prior assumptions on a parameter value are varied. The transparency 
of the role of individual degrees of belief can be seen as a plus of subjective 
Bayesianism from the vantage point of scientific objectivity. 


In the light of these arguments, claims that subjective Bayesians can- 
not quantify evidence in an objective way must be rejected as unjustified. 
They rely on a too narrow and one-sided view of scientific objectivity and 
an oversimplified picture of Bayesian inference. Even more, it has been 
shown that the diversity of prior distributions that characterizes subjec- 
tive Bayesianism can also be a strength from the point of view of scientific 
objectivity. 

One caveat, though. We do not want to promote subjective Bayesian- 
ism as a one-size-fits-all solution for problems of statistical inference, and 
more generally, scientific inference. We have mentioned in the previous 
variation how difficult it can be to come up with meaningful subjective 
priors, and how computationally expensive the calculation of posteriors 
can be. Rather, we have argued that when an inference problem lends it- 
self to a Bayesian analysis, the apparent lack of objectivity, due to the use 
of subjective degrees of belief, may be misleading. For subjective Bayesian 
inference, the objectivity problem is no more and less pressing than for 


any other inference method in science. 


This is not meant to deny that there are many open challenges for 
Bayesian inference with respect to their objectivity. For example, the ques- 
tion of meta-analysis of experiments translates, on a Bayesian reading, 
into the aggregation of posterior probability distributions. How can such 
a pooling procedure be immunized against the intrusion of bias (e.g., by 
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manipulating the prior probability distributions in the individual studies)? 

Second, in the light of the discussion about the replicability of psycho- 
logical research and the reliability of statistical analysis (loannidis, 2005; 
Makel et al., 2012; Francis, 2014; Francis et al., 2014; Open Science Collab- 
oration, 2015), some methodologists have called for radical conclusions. 
Taking issues with the NHST methdology which they see as highly prob- 
lematic, Trafimow and Marks (2015) have banned p-values from the jour- 
nal they are editing, the Journal of Basic and Applied Social Psychology 
(BASP). This is not to say that they are in favor of Bayesian inference: 
they see it as an alternative to frequentist inference that is viable in some 
cases, but often struggles to come up with meaningful prior distributions. 
Therefore, they recommend to conduct statistical analysis without infer- 
ential tools and to rely exclusively on descriptive statistics (e.g., effect 
sizes, correlation coefficients), which they see as more objective. Defend- 
ing Bayesian inference against this minimalist approach is an exciting task 
for those who are convinced by the arguments in this variation. 

Third and last, there are various varieties of “objective Bayesian infer- 
ence”, which try to find a middle ground between the two grand schools 
of statistical inference. With the exception of the Principle of Maximum 
Entropy, they have not received much attention from philosophers. Future 
research should take these approaches, explained below, more seriously 
and investigate whether they build a philosophically sound and practi- 
cally viable bridge between Bayesian and frequentist inference that lives 
up to the ambitions of objectivity. 


Objective Priors The method of objective priors in Bayesian inference 
tries to get the sting out of the first challenge (arbitrariness of priors) 
by giving up the static dimension of Bayesianism—that probabilities 
represent subjective degrees of belief. Instead, this method advocates 
priors that implement a sort of Principle of Indifference between the 
hypotheses under consideration, such as assigning equal probabil- 
ity to each parameter value. The problem with this approach is 
that the underlying Principle of Indifference is philosophically shaky 
(e.g., Hajek, 2011). However, statisticians have also worked on re- 
finements of this approach on information-theoretic grounds (e.g., 
Jeffreys, 1961). José Bernardo (1979a,b, 2012) proposed so-called ref- 
erence priors, motivated by invariance under 1:1-transformation of 


parameters of interest. See Sprenger (2012) for a discussion of their 
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philosophical implications. 


The Principle Maximum Entropy This approach parts with the dynamic 
dimension of Bayesian inference: the use of Bayesian Conditional- 
ization as a principle of belief revision. Rather, an agent’s ratio- 
nal degrees of belief should satisfy three constraints (Jaynes, 1968; 
Williamson, 2010): they should (i) conform to the axioms of probabil- 
ity; (ii) satisfy empirically given constraints on our rational degrees 
of belief; and given these constraints, (iii) they should be equivocal, 
that is, as middling as possible. This amounts to maximizing the en- 
tropy of the probability function that represents an agent’s rational 
degree of belief. If w denotes the atoms of the relevant o-algebra, the 
entropy is given by the term 


H=— )° p(w) log p(w). (11.3) 
we, 
While the Principle of Maximum Entropy is of great help in many 
practical problems in engineering, computer science, and related 
disciplines, it is hard to find a waterproof epistemic or decision- 
theoretic justification for why degrees of belief should in general be 
as middling as possible (Seidenfeld, 1979, 1986). 


Conditioning on Evidence Strength Of the three approaches discussed, 
this is the least known one. The idea is to give a valid Bayesian 
interpretation to frequentist error probabilities (type I and type II 
error) by appropriate conditioning on the strength of the observed 
evidence, e.g., an observed p-value (Berger et al., 1994, 1997; Berger, 
2003). It can be shown that in many cases, Bayesian and frequentist 
reasoners agree numerically in conditional inference; they just use 
different interpretations. Moreover, conditional inference is directly 
applicable to salient problems in the analysis of sequential trials in 
medicine (Nardini and Sprenger, 2013). These attempts to find a 
compromise between Bayesian and frequentist inference are, for the 
most part, still terra incognita from a philosophical point of view, 
but they strike us as original and worthy of further attention. 


The Theme Revisited 


This book is an attempt to analyze and to elucidate scientific reasoning 
by means of subjective Bayesian inference. Subjective Bayesians assign 
probabilities to scientific theories and interpret these values as personal 
degrees of belief. By the principle of Bayesian Conditionalization, or one 
of its generalizations, these degrees of belief are changed in the light of 
incoming evidence. While scientific reasoning is, of course, much broader 
and richer than what can be expressed by degrees of belief, subjective 
probability provides fruitful explications of several important concepts in 
science (such as confirmation, explanatory power, and causal effect), and 
it reconstructs several prominent argument patterns, such as the NAA and 
the NMA. The book brings together and unifies these Bayesian models in 
philosophy of science. 

More precisely, we vary the Bayesian theme p(E/H) = 
p(H)p(E|H)/p(E) into three directions. The first set of variations 
investigates various confirmatory arguments in science, most of which are 
not straightforwardly captured in the Bayesian framework. Variation 1 
provides a framework for learning conditionals (e.g., scientific hypotheses 
of the form “if A, then B”) in a Bayesian framework that is immune to 
the drawbacks of previous proposals. Variation 2 presents and evaluates 
different proposals for quantifying degree of confirmation. Variation 3 
deals with the Problem of Old Evidence—the problem of describing how 
learning a dependency between theory and evidence can boost confidence 
in a theory when the evidence itself is already known. In this variation, 
two novel solutions are provided. Variation 4 on the No Alternatives 
Argument shows how the failure to find alternatives to a theory can 
confirm a theory even in the absence of genuine empirical evidence. 
The argument also suggests how Inference to the Best Explanation can 
be justified within a confirmation-theoretic perspective. Variation 5 
frames the famous No Miracles Argument in favor of scientific realism 


275 


276 


in Bayesian terms and investigates its scope and limits according to 
different ways to frame and to model the argument. Taken together, these 
variations demonstrate that Bayesian confirmation theory extends beyond 
the standard case of evaluating predictions of a scientific hypothesis: it 
suits a remarkable variety of modes of scientific reasoning. 


The second set of variations focuses on causal effect, explanatory 
power, and intertheoretic coherence. While Variations 6 and 7 provide 
axiomatic characterizations of causal effect and of explanatory power, Vari- 
ation 8 demonstrates how the establishment of intertheoretic reductive re- 
lations can raise the probability of a scientific theory at the fundamental 
level. 


The third set of variations is motivated by issues in statistical infer- 
ence. Variation 9 closes a lacuna in the methodology of hypothesis testing 
by developing a probabilistic measure of corroboration—a task which is 
of utmost importance for scientific practice (e.g., for the interpretation of 
non-significant results). Variation 10 investigates the role of simplicity in 
Bayesian model selection. Finally, Variation 11 takes up various challenges 
to the objectivity of Bayesian inference and demonstrates that it is no less 
objective than its frequentist competitors. 


These final variations also demonstrate the limits of Bayesian 
modeling—for example, in Variation 10, we see that popular model se- 
lection criteria fail to have a Bayesian justification, and that they should, 
despite being known as Bayesian criteria, rather be seen as Bayesian 
heuristics. Variation 9 argues that there cannot be a purely Bayesian, 
confirmation-theoretic explication of corroboration. Bayesian and non- 
Bayesian approaches have to be combined in order to measure the degree 
to which a hypothesis has stood up to severe tests. Finally, Variation 11 
investigates scope and limits of objectivity in Bayesian reasoning. 


It is also notable that there is a high degree of methodological similar- 
ity between the different variations, despite the divergent nature of their 
explicanda. For instance, Variation 6 on causal effect transfers techniques 
from Bayesian Confirmation Theory (Variation 2) almost one-to-one. Sim- 
ilar things can be said about Variation 7 (explanatory power). Variation 
3 on the Problem of Old Evidence benefits from our improved account of 
learning conditional information in Variation 1. Finally, Variation 4 and 
5 model the assessment of a scientific theory by means of including an 
additional propositional variable: the number of available alternatives. 


The Theme Revisited 277 


We do not want to convince the reader that Bayesian modeling is a 
universal method or solution to all topics and problems in philosophy of 
science. This standpoint would be exposed to manifold criticism, e.g., see 
Norton (2003, 2011) for foundational criticisms of purely formal accounts 
of inductive inference. What we hope to have demonstrated is much less 
ambitious: that Bayesian inference is more than a simple and appealing 
theory for representing and updating degrees of belief. It is home to pow- 
erful models that can be applied to a surprising variety of problems in 


scientific reasoning. 


The use of Bayesian inference as a model for explicating scientific val- 
ues is also characteristic of our general methodology. Indeed, our ap- 
proach can be characterized as “scientific philosophy”—not in the sense 
of the logical empiricists or naturalists such as W.V.O. Quine (1969), but in 
an understanding that is closer to the views of Hans Reichenbach (1951) 
and Hannes Leitgeb (2013). See also Hartmann and Sprenger (2012). The 
logical empiricists (e.g. Carnap, 1935) understood scientific philosophy 
as the task of refining, improving and laying the foundations for a lan- 
guage of science. Naturalists such as Quine saw philosophy as a branch of 
science—e.g., epistemology was thought to reduce to cognitive psychol- 
ogy. Recently, Maddy (2009) and Ladyman and Ross (2009) have tried 
to revive this style of philosophical theorizing. We do believe, however, 
that the epistemic problems of science are genuinely philosophical problems 
that cannot be reduced to purely scientific questions (above all, because of 
the normative character of many questions). Like Leitgeb, we believe that 
such questions can be addressed with the use of scientific tools, that is, 
formal modeling, case studies, experimentation and computer-based sim- 
ulations, which can be fruitfully combined with conceptual analysis as a 
core methods of philosophical analysis. 


All these methods have a history in philosophy, some longer, some 
shorter. Conceptual analysis goes back to the very beginnings of philoso- 
phy, e.g., Plato’s famous analysis of knowledge. Mathematical and logical 
analysis have a similarly rich history, going back to Aristotle’s logic and 
the Medieval logicians. Interestingly, mathematics, and probability theory 
in particular, have been used less frequently for explicating philosophi- 
cal arguments—Hume’s Dialogues on Natural Religion and the famous 
10th chapter of the Enquiry on Human Understanding (“Of Miracles”) be- 
ing among the notable exceptions. Case studies—already part of Bacon’s 
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Novum Organon and the Descartes’ Discours de la Méthode (Part V)—have 
been popular in philosophy of science since the 1960s and 1970s. They an- 
swered the need for calibrating general philosophical models of scientific 
reasoning, such as those provided by the logical empiricists, with the prac- 
tice of science and have inspired philosophical theorizing ever since. For 
instance, the mechanistic model of explanation popularized by Machamer 
et al. (2000) and Craver (2007) heavily draws on case studies in cognitive 
and biological sciences. Experimental methods, by contrast, have a quite 
young history in philosophy, related to the emergence of experimental phi- 
losophy as a part of epistemology (e.g., Stich, 1988; Weinberg et al., 2001; 
Alexander and Weinberg, 2007), collaborations between cognitive scien- 
tists and philosophers of science (e.g., Crupi et al., 2008, 2013; Colombo 
et al., 2016a). Finally, over the last years, computational methods and 
agent-based simulations in particular have gained ground in philosophy 
of science. Often they are used to study the emergence and stability of 
social norms and contracts (e.g., Alexander, 2007; Skyrms, 2010; Muldoon 
et al., 2014), but sometimes they are also applied to modeling scientific 
progress and the communication structure of epistemic communities (e.g., 
Zollman, 2007; Weisberg and Muldoon, 2009; De Langhe and Rubbens, 
2015; Heesen, 2016a). Of particular interest are those studies where proba- 
bilstic reasoning in science (e.g., NHST) interacts with rewards and biases 
in the scientific community (e.g., Romero, 2016). Notably, all these meth- 
ods are rarely combined with each other, and doing so is perhaps one of 
the main innovations of this book. 


Indeed, most variations in this book feature a majority of these meth- 
ods. Conceptual analysis and formal modeling, the core methods of our 
explicative project, are used in almost any of the eleven variation. The 
final Variation 11, which evaluates the objectivity of Bayesian reasoning, is 
perhaps the only one that explicitly eschews formal modeling. Case stud- 
ies play an important role in Variation 3 (confirmation by old evidence), 
Variation 4 (string theory as an example of the NAA), Variation 6 (measur- 
ing causal effect), Variation 8 (reduction in statistical mechanics), Variation 
9 (corroboration and null hypothesis significant testing), and Variation 10 
(evaluating different model selection criteria). These variations also ad- 
dress methodological problems in specific disciplines (e.g., Variation 4 for 
particle physics, Variation 6 for cognitive psychology and medical science, 
Variation 9 and 10 for statistics). Experimental evidence from psychol- 
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Method Relevant Variation 
Conceptual Analysis all 
Formal Modeling 1-10 


Case Studies + Addressing Discipline-Specific 3, 4, 6, 8, 9, 10 
Methodological Problems 
Experimental Evidence 1,2,7,9 

Computational Methods 3, 4,5, 7,8 


Table 12.1: An overview of the methods used in the book. 


ogy and cognitive science is cited in Variation 1 (learning conditionals), 
Variation 2 (judgments of confirmation), Variation 6 (causal induction), 
Variation 7 (judgments of explanatory power) and Variation 9 (scientists’ 
use of null hypothesis significance tests). Finally, computational methods 
are used—sometimes behind the screens—in Variation 3 (comparing our 
assumptions to those proposed by Jeffrey & Co.), Variation 4 (degree of 
confirmation of the NAA), Variation 5 (scientrometic analysis of theoret- 
ical developments), Variation 7 (explanatory power vs. posterior proba- 
bility) and Variation 8 (degree of confirmation for successful reductions). 
Table 12.1 gives a schematic overview. 

We finish by sketching open questions for future research. For the 
reader’s convenience, we recap three open research questions from each 


variation below. 
Variation 1 Learning Conditional Information 
e Applying different divergence measures to learning conditional 


information 


e Transferring the analysis from indicative to subjunctive condi- 


tionals 

e Developing a general theory for causal and evidential con- 
straints on degree of belief 

Variation 2 Confirmation 

e Investigating Information-theoretic foundations of confirmation 
measures 

e Applying confirmation theory to the diagnostic value of scien- 
tific tests (e.g., in medicine) 

e Using confirmation judgments to explain phenomena in the 
psychology of reasoning 
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Variation 3 The Problem of Old Evidence (POE) 


Applying the POE solutions to the prediction/accommodation 
problem 

Integrating the POE with an analysis of explanatory reasoning 
in science 


Solving the POE in terms of learning conditional information 
(— Variation 1) 


Variation 4 The No Alternatives Argument (NAA) 


Relating the NAA to eliminative inference 
Applying of the NAA to Inference to the Best Explanation 


Finding instances of NAA-based reasoning in diverse scientific 
fields 


Variation 5 Scientific Realism and the No Miracles Argument (NMA) 


Studying parallels between the NAA and the NVA 


Conducting a scientometric analysis of theoretical stability in 
different scientific disciplines 


Extending the NMA towards the full realist thesis 


Variation 6 Causal Effect 


Calculating causal effect complicated network structures 


Generalizing the causal effect measure to real-valued variables 
and integrating it with statistical effect size measures 
Proposing a unified probabilistic theory of causal strength and 
causal specificity 


Variation 7 Explanatory Power 


Conducting experiments on the determinants of explanatory 
judgments 


Relating measures of explanatory power to measures of confir- 
mation and causal effect 
Developing a normatively convincing Bayesian account of IBE 
and abductive inference 
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Variation 8 Intertheoretic Reduction 


Checking the robustness of the analysis under different confir- 


mation measures 


Describing intertheoretic reduction as increasing the coherence 
of a set of theories 


Modeling the disconfirmation of the phenomenological theory 
by the fundamental theory 


Variation 9 Hypothesis Testing and Corroboration 


Extending the corrorboration measure to more complicated sta- 
tistical inference problems (nuisance parameters, hierarchial 


modeling, etc.) 


Conducting statistical meta-analysis with corroboration judg- 


ments 


Finding case studies for corroboration-based reasoning in the 
history of science 


Variation 10 Simplicity 


Exploring the context-sensitivity of model selection criteria 
Working out the thesis of instrumental Bayesianism, and trans- 


ferring it to other areas of statistical inference 


Comparing the role of simplicity in model selection to simplic- 
ity in causal and explanatory inference (e.g., IBE — Variation 5 
and 7) 


Variation 11 Objectivity 


Assessing whether Bayesian analysis is less vulnerable to repli- 
cation failure than frequentist analysis (i.e., NHST) 


Building bridges between Bayesian and frequentist reasoning 


Investigating the philosophical foundations of objective 
Bayesian methods 


Now, we proceed to describing five major research projects that fit into 


the Bayesian philosophy of science research program, and that bridge dif- 


ferent topics discussed in this book. 
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An obvious direction into which our research program could be ex- 
tended is the integration of probabilistic, causal and counterfactual rea- 
soning. The analysis of learning conditionals and confirmation by old 
evidence in Variation 1 and 3 showed that causal considerations often con- 
strain an agent’s rational degrees of belief. Moreover, the evaluation of 
conditional degrees of belief proceeds counterfactually, and via the notion 
of an intervention, counterfactual considerations play an important role 
in evaluating causal relations, too. While we restrict ourselves to solving 
some concrete problems at the intersection of causal and probabilistic in- 
ference, future work should come up with an integrated theory of causal 
induction, Bayesian learning and conditional reasoning (e.g., building on 
Oaksford and Chater, 2000; Pearl, 2000; Douven, 2016; Over, 2016). Apart 
from theoretical pioneering work, we see a lot of promise in experiments 
that investigate the role of causal structure in reasoning with conditionals. 
The same holds true for experiments that predict the truth or acceptability 
of a conditional by the strength of the expressed causal effect (— Variation 
6). 


Second, Bayesian confirmation is intimately related to causal and ex- 
planatory considerations (Variations 2, 6 and 7). On a theoretical level, 
this calls for an extended analysis of the relationship between measures 
of confirmation, causal effect and explanatory power, similar to what 
Schupbach (2016) did for measures of explanatory power and posterior 
probability. This should lead to a sharper demarcation of these concepts 
and to a description of the conditions when the one is conducive to the 
other. On an empirical level, we propose experiments that uncover correla- 
tions and differences in judgments of explanatory power, causal strength, 
and probability, in order to reveal the determinants of explanatory judg- 
ments and to provide a more nuanced and descriptively appropriate view 
of explanatory reasoning. Since explanation is a concept which is loaded 
with causal and probabilistic connotations, such experiments strike us as 
highly valuable. Colombo et al. (2016a) and Colombo et al. (2016b) already 
made some steps in this direction, but there is still much work to be done. 
See also the survey by Sloman and Lagnado (2015). This research could 
also be related to the role of simplicity in scientific reasoning: our formal 
analysis of simplicity in model selection (— Variation 10) should, at some 
point, be complemented by an empirical investigation of how simplicity 
considerations affect scientific reasoning, similar to what Lombrozo (2007) 
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did for simplicity in explanatory judgments. 


Third, the material in this book is an outstanding basis for a detailed 
investigation of the scope and rationality of Inference to the Best Expla- 
nation. Not only that one can assess IBE on the basis of diverse measures 
of explanatory power (— Variation 7), it is also possible to relate IBE to 
other argument patterns that we explicated in this book: the No Alter- 
natives Argument (NAA, — Variation 4) and the No Miracles Argument 
(NMA, — Variation 5). As we argued in those Variations, both arguments 
are essentially abductive in arguing that the empirical adequacy of a the- 
ory T is the best explanation for the absence of viable alternatives (— 
NAA) and the predictive success of T (— NMA). 


Fourth, this book does not include a variation on unification and its 
role in reduction and explanation (— Variation 7 and 8)—mainly because 
there is not much literature on this topic from a Bayesian perspective. 
Unification is traditionally regarded as an important cognitive value in 
scientific reasoning (e.g. McMullin, 1982; Douglas, 2013), as a value that 
counts for most scientists as a reason to accept a theory and to pursue it 
further. Based on the pioneer work done by Myrvold (2003, 2016) and 
Schupbach (2005), it seems plausible to explicate unification by means 
of confirmation-theoretical or information-theoretic models, to explain its 
role in intertheoretic reductions and explanatory reasoning, and to de- 
scribe unification in important case studies, such as Bayesian cognitive 
science (Colombo and Hartmann, 2016). 


Fifth and last, Bayesian methods can provide better foundations for 
hypothesis testing in science. It has been frequently noted that the current 
method of hypothesis testing, essentially based on p-values, is not only at 
odds with the very principles of Bayesian reasoning, but also a danger 
for the reliability of scientific inquiry (e.g., Berger and Sellke, 1987; Good- 
man, 1999a; Cumming, 2012, 2014). It is therefore important to integrate 
Bayesian reasoning into hypothesis tests and to reconcile both paradigms 
(Wetzels et al., 2009; Wetzels and Wagenmakers, 2012; Lee and Wagenmak- 
ers, 2013; Morey et al., 2014, 2016). However, doing so is often far from 
straightforward due to the different motivations that feed Bayesian confir- 
mation theory (— Variation 2) and hypothesis testing in the tradition of 
Popper and Fisher. Variation 9 makes an attempt to quantify the degree 
of corroboration of a hypothesis and Variation 6 axiomatizes various mea- 
sures of causal effect that could result from RCTs or case-control studies. 
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These projects need to be pursued further in order to obtain a full Bayesian 
account of hypothesis testing. 


This brings us, finally, to a wider perspective on Bayesian philosophy 
of science. First, this book has neglected the social dimension of science. 
We have, so far, focused on the perspective of an individual scientist (or 
a homogenous research team) who does experiments, analyzes data and 
assesses theories. Future work could link the issues covered in this book 
to questions about merging opinions and the role of experts in science (for 
survey articles, see Dietrich and List, 2016; Martini and Sprenger, 2016). 
For a yet different research program in the social epistemology of science 
that can be tackled by Bayesian models, consider the exploration of epis- 
temic landscapes and the credit reward system in science (e.g., Zollman, 
2007; Weisberg and Muldoon, 2009; Heesen, 2016a,b). 


Second, we could tighten the link between Bayesian reasoning in phi- 
losophy and Bayesian reasoning in science (e.g., Bayesian statistics). In 
this book, we have only scratched the surface, but there is a fascinating 
and largely unexplored set of questions how philosophical insights about 
Bayesian reasoning and hypothesis testing should translate into practical 
statistical reasoning (e.g., Gallistel, 2009; Bernardo, 2012; Sprenger, 2013b). 
The fifth research project listed above, concerned with scientific hypoth- 
esis testing and Bayesian Inference, falls into this domain. But there are 
also more general methodological questions. For example, Gelman and 
Shalizi (2012, 2013) suggest that Bayesian inference is very convenient 
at the micro-level of statistical inference within a given class of models, 
but that the proper task of model testing and evaluation rather follows a 
hypothetico-deductive rationale. Moreover, there has not yet been a sys- 
tematic investigation and comparison of the philosophical foundations of 
the different objective Bayesian approaches, and the conceptions of objec- 
tivity that they endorse. Given the centrality of claims to objectivity in 
the evaluation of research findings in modern science, this strikes us as a 
highly worthwhile endeavor. 


Both projects can also join forces. The social sciences, psychology in 
particular, are undergoing a replication crisis, that is, difficulties to repro- 
duce findings from published experiments (e.g., Galak et al., 2012; Makel 
et al., 2012; Francis, 2014; Francis et al., 2014; Open Science Collaboration, 
2015). How much can we trust scientific method if research results are so 
often unstable? Bayesian methods may provide an answer to this question: 
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they have been used to explain why the current publication culture pro- 
motes unreliable findings and to point out how such biases can be cured 
(e.g., Ioannidis, 2005; Ioannidis and Trikalinos, 2007). The work in this 
book, especially in Variation 2, 9 and 11, provides a starting point for fur- 
ther philosophical investigation of these problems. Working along these 
lines would combine Bayesian philosophy of science with the practice of 
Bayesian statistics and a social perspective of the scientific enterprise. 

All in all, it should be clear by now that there is an exciting and in- 
exhaustible set of unanswered research questions in Bayesian philosophy 
of science. Therefore, we predict that the Bayesian research program will 
have a bright and fascinating future—in philosophy of science and beyond. 
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